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Preface 


In recent years, statistical machine learning (ML) has become very successful, it has 
triggered a renaissance of artificial intelligence (AI) and has improved enormously in 
predictivity. Sophisticated models have steadily increased in complexity, which has often 
happened at the expense of human interpretability (correlation vs. causality). Conse- 
quently, an active field of research called explainable AI (xAI) has emerged with the 
goal of creating tools and models that are both predictive and interpretable and under- 
standable for humans. The growing xAI community has already achieved important 
advances, such as robust heatmap-based explanations of DNN classifiers. From appli- 
cations in digital transformation (e.g., agriculture, climate, forest operations, medical 
applications, cyber-physical systems, automation tools and robotics, sustainable living, 
sustainable cities, etc.), there is now a need to massively engage in new scenarios, such 
as explaining unsupervised and intensified learning and creating explanations that are 
optimally structured for human decision makers. While explainable AI fundamentally 
deals with the implementation of transparency and traceability of statistical black-box 
ML methods, there is an urgent need to go beyond explainable AI, e.g., to extend explain- 
able AI with causability, to measure the quality of explanations, and to find solutions 
for building efficient human-AlI interfaces for these novel interactions between artifi- 
cial intelligence and human intelligence. For certain tasks, interactive machine learning 
with the human-in-the-loop can be advantageous because a human domain expert can 
sometimes complement the AI with implicit knowledge. Such a human-in-the-loop can 
sometimes — not always of course — contribute to an artificial intelligence with experi- 
ence, conceptual understanding, context awareness and causal reasoning. Formalized, 
this human knowledge can be used to create structural causal models of human decision 
making, and features can be traced back to train AI — and thus contribute to making AI 
even more successful — beyond the current state-of-the-art. The field of explainable AI 
has received exponential interest in the international machine learning and AI research 
community. Awareness of the need to explain ML models has grown in similar propor- 
tions in industry, academia and government. With the substantial explainable AI research 
community that has been formed, there is now a great opportunity to make this push 
towards successful explainable AI applications. 

With this volume of Springer Lecture Notes in Artificial Intelligence (LNAI), we 
will help the international research community to accelerate this process, promote a 
more systematic use of explainable AI to improve models in diverse applications, and 
ultimately help to better understand how current explainable AI methods need to be 
improved and what kind of theory of explainable AI is needed. 
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The contributions in this volume were very carefully selected by the editors together 
with help from the Scientific Committee and each paper was reviewed by three 
international experts in the field. 
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Abstract. The success of statistical machine learning from big data, 
especially of deep learning, has made artificial intelligence (AI) very 
popular. Unfortunately, especially with the most successful methods, the 
results are very difficult to comprehend by human experts. The appli- 
cation of AI in areas that impact human life (e.g., agriculture, climate, 
forestry, health, etc.) has therefore led to an demand for trust, which 
can be fostered if the methods can be interpreted and thus explained 
to humans. The research field of explainable artificial intelligence (XAI) 
provides the necessary foundations and methods. Historically, XAI has 
focused on the development of methods to explain the decisions and 
internal mechanisms of complex AI systems, with much initial research 
concentrating on explaining how convolutional neural networks produce 
image classification predictions by producing visualizations which high- 
light what input patterns are most influential in activating hidden units, 
or are most responsible for a model’s decision. In this volume, we sum- 
marize research that outlines and takes next steps towards a broader 
vision for explainable AI in moving beyond explaining classifiers via such 
methods, to include explaining other kinds of models (e.g., unsupervised 
and reinforcement learning models) via a diverse array of XAI techniques 
(e.g., question-and-answering systems, structured explanations). In addi- 
tion, we also intend to move beyond simply providing model explanations 
to directly improving the transparency, efficiency and generalization abil- 
ity of models. We hope this volume presents not only exciting research 
developments in explainable AI but also a guide for what next areas to 
focus on within this fascinating and highly relevant research field as we 
© The Author(s) 2022 
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enter the second decade of the deep learning revolution. This volume is an 
outcome of the ICML 2020 workshop on “XXAI: Extending Explainable 
AI Beyond Deep Models and Classifiers.” 


Keywords: Artificial intelligence - Explainable AI - Machine 
learning - Explainability 


1 Introduction and Motivation for Explainable AI 


In the past decade, deep learning has re-invigorated the machine learning 
research by demonstrating its power in learning from vast amounts of data in 
order to solve complex tasks - making AI extremely popular [5], often even 
beyond human level performance [24]. However, its power is also its peril: deep 
learning models are composed of millions of parameters; their high complex- 
ity [17] makes such “black-box” models challenging for humans to understand 
[20]. As such “black-box” approaches are increasingly applied to high-impact, 
high-risk domains, such as medical AI or autonomous driving, the impact of its 
failures also increases (e.g., medical misdiagnoses, vehicle crashes, etc.). 
Consequently, there is an increasing demand for a diverse toolbox of meth- 
ods that help AI researchers and practitioners design and understand complex 
AI models. Such tools could provide explanations for model decisions, suggest 
corrections for failures, and ensure that protected features, such as race and gen- 
der, are not misinforming or biasing model decisions. The field of explainable 
AI (XAI) [32] focuses on the development of such tools and is crucial to the 
safe, responsible, ethical and accountable deployment of AI technology in our 
wider world. Based on the increased application of AI in practically all domains 
which affects human life (e.g., agriculture, climate, forestry, health, sustainable 
living, etc.), there is also a need to address new scenarios in the future, e.g., 
explaining unsupervised and intensified learning and creating explanations that 
are optimally structured for human decision makers with respect to their indi- 
vidual previous knowledge. While explainable AI is essentially concerned with 
implementing transparency and tractability of black-box statistical ML meth- 
ods, there is an urgent need in the future to go beyond explainable AI, e.g., to 
extend explainable AI to include causality and to measure the quality of expla- 
nations [12]. A good example is the medical domain where there is a need to ask 
“what-if” questions (counterfactuals) to gain insight into the underlying inde- 
pendent explanatory factors of a result [14]. In such domains, and for certain 
tasks, a human-in-the-loop can be beneficial, because such a human expert can 
sometimes augment the AI with tacit knowledge, i.e. contribute to an AI with 
human experience, conceptual understanding, context awareness, and causal rea- 
soning. Humans are very good at multi-modal thinking and can integrate new 
insights into their conceptual knowledge space shaped by experience. Humans 
are also robust, can generalize from a few examples, and are able to understand 
context from even a small amount of data. Formalized, this human knowledge 
can be used to build structural causal models of human decision making, and 
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the features can be traced back to train AI - helping to make current AI even 
more successful beyond the current state of the art. 

In such sensitive and safety-critical application domains, there will be an 
increasing need for trustworthy AI solutions in the future [13]. Trusted AI 
requires both robustness and explainability and should be balanced with human 
values, ethical principles [25], and legal requirements [36], to ensure privacy, secu- 
rity, and safety for each individual person. The international XAI community is 
making great contributions to this end. 


2 Explainable AI: Past and Present 


In tandem with impressive advances in AI research, there have been numerous 
methods introduced in the past decade that aim to explain the decisions and 
inner workings of deep neural networks. Many such methods can be described 
along the following two axes: (1) whether an XAI method produces local or global 
explanations, that is, whether its explanations explain individual model deci- 
sions or instead characterize whole components of a model (e.g., a neuron, layer, 
entire network); and (2) whether an XAI method is post-hoc or ante-hoc, that is, 
whether it explains a deep neural network after it has been trained using standard 
training procedures or it introduces a novel network architecture that produces 
an explanation as part of its decision. For a brief overview on XAI methods please 
refer to [15]. Of the research that focuses on explaining specific predictions, the 
most active area of research has been on the problem of feature attribution [31], 
which aims to identify what parts of an input are responsible for a model’s out- 
put decision. For computer vision models such as object classification networks, 
such work typically produce heatmaps that highlight which regions of an input 
image most influence a model’s prediction [3,8, 28, 33-35, 38, 41]. 

Similarly, feature visualization methods have been the most popular research 
stream within explainable techniques that provide global explanations. Such 
techniques typically explain hidden units or activation tensors by showing either 
real or generated images that most activate the given unit [4,27,35,38, 40] or set 
of units [10,18,42] or are most similar to the given tensor [21]. 

In the past decade, most explainable AI research has focused on the develop- 
ment of post-hoc explanatory methods like feature attribution and visualization. 

That said, more recently, there have been several methods that introduce 
novel, interpretable-by-design models that were intentionally designed to pro- 
duce an explanation, for example as a decision tree [26], via graph neural net- 
works [29], by comparing to prototypical examples [7], by constraining neurons 
to correspond to interpretable attributes [19,22], or by summing up evidence 
from multiple image patches [6]. 

As researchers have continued to develop explainable AI methods, some work 
has also focused on the development of disciplined evaluation benchmarks for 
explainable AI and have highlighted some shortcomings of popular methods and 
the need for such metrics [1-3,9, 11, 16, 23,28, 30,37, 39]. 

In tandem with the increased research in explainable AI, there have been a 
number of research outputs [32] and gatherings (e.g., tutorials, workshops, and 
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conferences) that have focused on this research area, which have included some 
of the following: 


— NeurIPS workshop on “Interpreting, Explaining and Visualizing Deep Learn- 
ing — Now what?” (2017) 

— ICLR workshop on “Debugging Machine Learning Models” (2019) 

— ICCV workshop on “Workshop on Interpretating and Explaining Visual AI 
Models” (2019) 

— CVPR tutorial on “Interpretable Machine Learning for Computer Vision” 
(2018-ongoing) 

— ACM Conference on Fairness, Accountability, and Transparency (FAccT) 
(2018-ongoing) 

— CD-MAKE conference with Workshop on xAI (2017-ongoing) 


Through these community discussions, some have recognized that there were 
still many under-explored yet important areas within explainable AI. 


Beyond Explainability. To that end, we organized the ICML 2020 workshop 
“XXAI: Extending Explainable AI Beyond Deep Models and Classifiers,” which 
focused on the following topics: 


1. Explaining beyond neural network classifiers and explaining other kinds of 
models such as random forests and models trained via unsupervised or rein- 
forcement learning. 

2. Explaining beyond heatmaps and using other forms of explanation such as 
structured explanations, question-and-answer and/or dialog systems, and 
human-in-the-loop paradigms. 

3. Explaining beyond explaining and developing other research to improve the 
transparency of AI models, such as model development and model verification 
techniques. 


This workshop fostered many productive discussions, and this book is a follow- 
up to our gathering and contains some of the work presented at the workshop 
along with a few other relevant chapters. 


3 Book Structure 


We organized this book into three parts: 


1. Part 1: Current Methods and Challenges 
2. Part 2: New Developments in Explainable AI 
3. Part 3: An Interdisciplinary Approach to Explainable AI 


Part 1 gives an overview of the current state-of-the-art of XAI methods as well as 
their pitfalls and challenges. In Chapter 1, Holzinger, Samek and colleagues give 
a general overview on popular XAI methods. In Chapter 2, Bhatt et al. point out 
that current explanation techniques are mainly used by the internal stakeholders 
who develop the learning models, not by the external end-users who actually 
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get the service. They give nice take away messages learned from an interview 
study on how to deploy XAI in practice. In Chapter 3, Molnar et al. describe 
the general pitfalls a practitioner can encounter when employing model agnostic 
interpretation methods. They point out that the pitfalls exist when there are 
issues with model generalization, interactions between features etc., and called 
for a more cautious application of explanation methods. In Chapter 4, Salewski 
et al. introduce a new dataset that can be used for generating natural language 
explanations for visual reasoning tasks. 

In Part 2, several novel XAI approaches are given. In Chapter5, Kolek 
et al. propose a novel rate-distortion framework that combines mathemati- 
cal rigor with maximal flexibility when explaining decisions of black-box mod- 
els. In Chapter 6, Montavon et al. present an interesting approach, dubbed as 
neuralization-propagation (NEON), to explain unsupervised learning models, for 
which directly applying the supervised explanation techniques is not straight- 
forward. In Chapter 7, Karimi et al. consider a causal effect in the algorithmic 
recourse problem and presents a framework of using structural causal models 
and a novel optimization formulation. The next three chapters in Part 2 mainly 
focus on XAI methods for problems beyond simple classification. In Chapter 8, 
Zhou gives a brief summary on recent work on interpreting deep generative 
models, like Generative Adversarial Networks (GANs), and show how human- 
understandable concepts can be identified and utilized for interactive image gen- 
eration. In Chapter 9, Dinu et al. apply explanation methods to reinforcement 
learning and use the recently developed RUDDER framework in order to extract 
meaningful strategies that an agent has learned via reward redistribution. In 
Chapter 10, Bastani et al. also focus on interpretable reinforcement learning and 
describe recent progress on the programmatic policies that are easily verifiable 
and robust. The next three chapters focus on using XAI beyond simple expla- 
nation of a model’s decision, e.g., pruning or improving models with the aid of 
explanation techniques. In Chapter 11, Singh et al. present the PDR framework 
that considers three aspects: devising a new XAI method, improving a given 
model with the XAI methods, and verifying the developed methods with real- 
world problems. In Chapter 12, Bargal et al. describe the recent approaches that 
utilize spatial and spatiotemporal visual explainability to train models that gen- 
eralize better and possess more desirable characteristics. In Chapter 13, Becking 
et al. show how explanation techniques like Layer-wise Relevance Propagation 
[3] can be leveraged with information theory concepts and can lead to a bet- 
ter network quantization strategy. The next two chapters then exemplify how 
XAI methods can be applied to various kinds of science problems and extract 
new findings. In Chapter 14, Marcos et al. apply explanation methods to marine 
science and show how a landmark-based approach can generate heatmaps to 
monitor migration of whales in the ocean. In Chapter 15, Mamalakis et al. sur- 
vey interesting recent results that applied explanation techniques to meteorology 
and climate science, e.g., weather prediction. 

Part 3 presents more interdisciplinary application of XAI methods beyond 
technical domains. In Chapter16, Hacker and Passoth provide an overview 
of legal obligations to explain AI and evaluate current policy proposals. 
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In Chapter 17, Zhou et al. provide a state-of-the-art overview on the relations 
between explanation and AI fairness and especially the roles of explanation on 
human’s fairness judgement. Finally, in Chapter 18, Tsai and Carroll review 
logical approaches to explainable AI (XAI) and problems/challenges raised for 
explaining AI using genetic algorithms. They argue that XAI is more than a 
matter of accurate and complete explanation, and that it requires pragmatics of 
explanation to address the issues it seeks to address. 

Most of the chapters fall under Part 2, and we are excited by the variety 
of XAI research presented in this volume. While by no means an exhaustive 
collection, we hope this book presents both quality research and vision for the 
current challenges, next steps, and future promise of explainable AI research. 
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Abstract. Explainable Artificial Intelligence (xAI) is an established 
field with a vibrant community that has developed a variety of very 
successful approaches to explain and interpret predictions of complex 
machine learning models such as deep neural networks. In this article, we 
briefly introduce a few selected methods and discuss them in a short, clear 
and concise way. The goal of this article is to give beginners, especially 
application engineers and data scientists, a quick overview of the state 
of the art in this current topic. The following 17 methods are covered 
in this chapter: LIME, Anchors, GraphLIME, LRP, DTD, PDA, TCAV, 
XGNN, SHAP, ASV, Break-Down, Shapley Flow, Textual Explanations 
of Visual Models, Integrated Gradients, Causal Models, Meaningful Per- 
turbations, and X-NeSyL. 


Keywords: Explainable AI - Methods - Evaluation 


1 Introduction 


Artificial intelligence (AI) has a long tradition in computer science. Machine 
learning (ML) and particularly the success of “deep learning” in the last decade 
made AI extremely popular again [15, 25,90]. 

The great success came with additional costs and responsibilities: the most 
successful methods are so complex that it is difficult for a human to re-trace, 
to understand, and to interpret how a certain result was achieved. Conse- 
quently, explainability /interpretability/understandability is motivated by the 
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lack of transparency of these black-box approaches, which do not foster trust 
and acceptance of AI in general and ML in particular. Increasing legal and 
data protection aspects, e.g., due to the new European General Data Protection 
Regulation (GDPR, in force since May 2018), complicate the use of black-box 
approaches, particularly in domains that affect human life, such as the medical 
field [56,63, 73, 76]. 

The term explainable AI (xAI) was coined by DARPA [28] and gained mean- 
while a lot of popularity. However, xAI is not a new buzzword. It can be seen 
as a new name for a very old quest in science to help to provide answers to 
questions of why [66]. The goal is to enable human experts to understand the 
underlying explanatory factors of why an AI decision has been made [64]. This 
is highly relevant for causal understanding and thus enabling ethical responsible 
AI and transparent verifiable machine learning in decision support [74]. 

The international community has developed a very broad range of different 
methods and approaches and here we provide a short concise overview to help 
engineers but also students to select the best possible method. Figure 1 shows 
the most popular X AI toolboxes. 
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Fig. 1. Number of stars on GitHub for the most popular repositories presented in this 
paper. While these repositories focus on the explanation task, the new Quantus toolbox 
[30] offers a collection of methods for evaluating and comparing explanations. 


In the following we provide a short overview of some of the most popu- 
lar methods for explaining complex models. We hope that this list will help 
both practitioners in choosing the right method for model explanation and 
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XAI method developers in noting the shortcomings of currently available meth- 
ods. Figure 2 gives an overview of the chronology of development of successive 
explanatory methods. Methods such as LRP and LIME were among the first! 
generic techniques to explain decisions of complex ML models. In addition to 
the overview of explanation techniques, we would also like to hint the inter- 
ested reader at work that developed methods and offered datasets to objectively 
evaluate and systematically compare explanations. To mention here is Quantus? 
[30], a new toolbox offering an exhaustive collection of evaluation methods and 
metrics for explanations, and CLEVR-XAI® [8], a benchmark dataset for the 
ground truth evaluation of neural network explanations. 
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Fig. 2. Chronology of the development of successive explanatory methods described in 
this paper. Initially, the methods were focused on model analysis based on the model 
itself or on sample data. Subsequent methods used more and more information about 
the structure and relationships between the analysed variables. 


2 Explainable AI Methods - Overview 


2.1 LIME (Local Interpretable Model Agnostic Explanations) 


Idea: By treating the machine learning models as black-box functions, model 
agnostic explanation methods typically only have access to the model’s output. 
The fact that these methods do not require any information about the model’s 
internals, e.g., in the case of neural networks the topology, learned parameters 
(weights, biases) and activation values, makes them widely applicable and very 
flexible. 


1 We are aware that gradient-based sensitivity analysis and occlusion-based techniques 
have been proposed even earlier [11,62, 75,89]. However, theses techniques have var- 
ious disadvantages (see [61,70]) and are therefore not considered in this paper. 

? https: //github.com /understandable-machine-intelligence-lab/quantus. 

3 https: //github.com/ahmedmagdiosman /clevr-xai. 
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One prominent representative of this class of explanation techniques is the 
Local Interpretable Model-agnostic Explanations (LIME) method [67]. The main 
idea of LIME is to explain a prediction of a complex model fm, e.g., a deep neural 
network, by fitting a local surrogate model fs, whose predictions are easy to 
explain. Therefore, LIME is also often referred to as surrogate-based explanation 
technique [70]. Technically, LIME generates samples in the neighborhood Mx, of 
the input of interest x;, evaluates them using the target model, and subsequently 
approximates the target model in this local vicinity by a simple linear function, 
i.e., a surrogate model which is easy to interpret. Thus, LIME does not directly 
explain the prediction of the target model fm(X;), but rather the predictions 
of a surrogate model fs(x;), which locally approximates the target model (i.e., 


f(x) © fs(x) for x € Nx). 
GitHub Repo: https://github.com/marcoter 


Discussion: There are meanwhile many successful applications of LIME in 
different application domains which demonstrates the popularity of this model 
agnostic method. As a limitation can be seen that LIME only indirectly solves 
the explanation problem by relying on a surrogate model. Thus, the quality of 
the explanation largely depends on the quality of the surrogate fit, which itself 
may require dense sampling and thus may result in large computational costs. 
Furthermore, sampling always introduces uncertainty, which can lead to non- 
deterministic behaviours and result in variable explanations for the same input 
sample. 


2.2 Anchors 


Idea: The basic idea is that individual predictions of any black-box classifica- 
tion model are explained by finding a decision rule that sufficiently “anchors” 
the prediction - hence the name “anchors” [68]. The resulting explanations are 
decision rules in the form of IF-THEN statements, which define regions in the 
feature space. In these regions, the predictions are fixed (or “anchored”) to the 
class of the data point to be explained. Consequently, the classification remains 
the same no matter how much the other feature values of the data point that 
are not part of the anchor are changed. 

Good anchors should have high precision and high coverage. Precision is the 
proportion of data points in the region defined by the anchor that have the same 
class as the data point being explained. Coverage describes how many data points 
an anchor’s decision rule applies to. The more data points an anchor covers, the 
better, because the anchor then covers a larger area of the feature space and thus 
represents a more general rule. Anchors is a model-agnostic explanation method, 
i.e., it can be applied to any prediction model without requiring knowledge about 
the internals. Search and construction of decision rules is done by reinforcement 
learning (RL) [32,81] in combination with a modified beam search, a heuristic 
search algorithm that extends the most promising nodes in a graph. The anchors 
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algorithm cycles through different steps: produce candidate anchors, select the 
best candidates, then use beam search to extend the anchor rules. To select the 
best candidate, it is necessary to call the model many times, which can be seen 
as an exploration or multi-armed bandit problem. 


GitHub Repo: https://github.com/marcotcr/anchor 


Discussion: The anchors are model-independent and can be applied to different 
domains such as tabular data, images and text, depending on the perturbation 
strategy. However, in the current Python implementation, anchors only supports 
tabular and text data. Compared to LIME, the scope of interpretation is clearer 
as the anchors specify the boundaries within which they should be interpreted. 
The coverage of an anchor decision rule can be used as a measure of the model 
fidelity of the anchor. Furthermore, the decision rules are easy to understand, but 
there are many hyper-parameters in the calculation of anchors, such as the width 
of the beam and the precision threshold, which need to be tuned individually. 
The perturbation strategies also need to be carefully selected depending on the 
application and model. The calculation of anchors requires many calls to the 
prediction function, which makes the anchors computationally intensive. Data 
instances that are close to the decision boundary of the model may require more 
complex rules with more features and less coverage. Unbalanced classification 
problems can produce trivial decision rules, such as classifying each data point 
as the majority class. A possible remedy is to adapt the perturbation strategy 
to a more balanced distribution. 


2.3 GraphLIME 


Idea: GraphLIME [38] is a method that takes the basic idea of LIME (see 
Sect.2.1) but is not linear. It is applied to a special type of neural network 
architecture, namely graph neural networks (GNN). These models can process 
non-Euclidean data as they are organised in a graph structure [9]. The main 
tasks that GNNs perform are node classification, link prediction and graph clas- 
sification. Like LIME, this method tries to find an interpretable model, which 
in this case is the Hilbert-Schmidt Independence Criterion (HSIC) Lasso model, 
for explaining a particular node in the input graph. It takes into account the 
fact that during the training of the GNN, several nonlinear aggregation and 
combination methods use the features of neighbouring nodes to determine the 
representative embedding of each node. This embedding is used to distinguish 
nodes into different classes in the case of node classification and to collectively 
distinguish graphs in graph classification tasks. 

Since for this type of model a linear explanation as LIME would return 
unfaithful results, the main idea of GraphLIME is to sample from the N-hop 
neighbourhood of the node and collect features w.r.t. to the node prediction. 
Those are used to train the HSIC Lasso model, which is a kernel method - 
thereby interpretable - that can compute on which node features the output 
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prediction depends on. This is similar to the perturbation method that LIME 
uses, while the comparison, in this case, is based on HSIC estimation between 
the random variables representing the features and the prediction distributions. 
This method learns correlations between the features of the neighbours which 
also underline its explanation capabilities. 


GitHub Repo: https://github.com/WilliamCCHuang/GraphLIME 


Discussion: The developers compared GraphLIME with one of the first xAI 
methods for GNNs at the time, namely GNNExplainer, w.r.t. three criteria: 
(1) ability to detect useless features, (2) ability to decide whether the pre- 
diction is trustworthy, and (3) ability to identify the better model among two 
GNN classifiers. They show that for synthetic data and human-labelled anno- 
tations, GraphLIME exceeds the GNNExplainer by far in the last two crite- 
ria. They arrive at a very interesting insight, namely that models that have 
fewer untrustworthy features in their explanation have better classification per- 
formance. Furthermore, GraphLIME is shown to be computationally much more 
efficient than GNNExplainer. It would be beneficial - and is considered future 
work - if GraphLIME was also trying to find important graph substructures 
instead of just features, if it was compared with other methods like PGExplainer 
[52], PGMExplainer [83], GNN-LRP [72], and if it were extended to multiple 
instance explanations. Finally, it is important to note that GraphLIME is suc- 
cessfully used for the investigation of backdoor attacks on GNNs by uncovering 
the relevant features of the graph’s nodes [86]. 


2.4 Method: LRP (Layer-wise Relevance Propagation) 


Idea: Layer-wise Relevance Propagation (LRP) [10] is a propagation-based 
explanation method, i.e., it requires access to the model’s internals (topology, 
weights, activations etc.). This additional information about the model, however, 
allows LRP to simplify and thus more efficiently solve the explanation problem. 
More precisely, LRP does not explain the prediction of a deep neural network in 
one step (as model agnostic methods would do), but exploits the network struc- 
ture and redistributes the explanatory factors (called relevance R) layer by layer, 
starting from the model’s output, onto the input variables (e.g., pixels). Each 
redistribution can be seen as the solution of a simple (because only between two 
adjacent layers) explanation problem (see interpretation of LRP as Deep Taylor 
Decomposition in Sect. 2.5). 

Thus, the main idea of LRP is to explain by decomposition, i.e., to iteratively 
redistribute the total evidence of the prediction f(x), e.g., indicating that there 
is a cat in the image, in a conservative manner from the upper to the next lower 
layer, i.e., 


ee D OD ee 
i j k 
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Note that RO denotes the relevance assigned to the ith input element (e.g., 


pixel), while RY stands for relevance assigned to the jth neuron at the lth layer. 
This conservative redistribution not only ensures that no relevance is added or 
lost on the way (analogous to energy conservation principle or Kirchhoff’s law 
in physics), but also allows for signed explanations, where positive relevance 
values hint at relevant information supporting the prediction and negative rele- 
vance values indicate evidence speaking against it. Different redistribution rules, 
adapted to the specific properties of particular neural network layers, have been 
proposed for LRP [42,58]. In contrast to other XAI techniques which are purely 
based on heuristics, the LRP rules have a clear theoretical foundation, namely 
they result from the Deep Taylor Decomposition (DTD) [60] of the relevance 
function with a particular choice of root point (see Sect. 2.5). 

While LRP has been originally developed for convolutional neural networks 
and bag-of-words type of models, various extensions have been proposed, making 
it a widely applicable XAI techniqe today. For instance, Arras et al. [6,7] devel- 
oped meaningful LRP redistribution rules for LSTM models. Also LRP variants 
for GNN and Transformer models have been recently proposed [3,72]. Finally, 
through the “neuralization trick”, i.e., by converting a non-neural network model 
into a neural network, various other classical ML algorithms have been made 
explainable with LRP, including k-means clustering [39], one-class SVM [40] as 
well as kernel density estimation [59]. Furthermore, meta analysis methods such 
as spectral relevance analysis (SpRAy) [47] have been proposed to cluster and 
systematically analyze sets of explanations computed with LRP (SpRAY is not 
restricted to LRP explanations though). These analyses have been shown useful 
to detect artifacts in the dataset and uncover so-called “Clever Hans” behaviours 
of the model [47]. 

The recently published Zennit toolbox [4] implements LRP (and other meth- 
ods) in Python, while the CoRelAy* toolboxi offers a collection of meta analysis 
methods. Furthermore, the GitHub library iNNvestigate provides a common 
interface and out-of-the-box implementation for many analysis methods, includ- 
ing LRP [2]. 


GitHub Repo: https://github.com/chr5tphr/zennit 
https://github.com/albermax/innvestigate 


Discussion: LRP is a very popular explanation method, which has been applied 
in a broad range of domains, e.g., computer vision [46], natural language pro- 
cessing [7], EEG analysis [78], meteorology [54], among others. 

The main advantages of LRP are its high computational efficiency (in the 
order of one backward pass), its theoretical underpinning making it a trustworthy 
and robust explanation method (see systematic comparison of different XAI 
methods [8]), and its long tradition and high popularity (it is one of the first 
XAI techniques, different highly efficient implementations are available, and it 


4 https: //github.com/virelay /corelay. 
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has been successfully applied to various problems and domains). The price to pay 
for the advantages is a restricted flexibility, i.e., a careful adaptation of the used 
redistribution rules may be required for novel model architectures. For many 
popular layers types recommended redistribution rules are described in [42,58]. 

Finally, various works showed that LRP explanations can be used beyond 
sheer visualization purposes. For instance, [47] used them to semi-automatically 
discover artefacts in large image corpora, while [5,79] went one step further and 
demonstrated that they can be directly (by augmenting the loss) or indirectly (by 
adapting training data) used to improve the model. Another line of work [14,87] 
exploits the fact that LRP computes relevance values not only for the input 
variables, but for all elements of the neural network, including weights, biases 
and individual neurons, to optimally prune and quantize the neural model. The 
idea is simple, since LRP explanations tell us which parts of the neural network 
are relevant, we can simply remove the irrelevant elements and thus improve the 
coding efficiency and speed up the computation. 


2.5 Deep Taylor Decomposition (DTD) 


Idea: The Deep Taylor Decomposition (DTD) method [60] is a propagation- 
based explanation technique, which explains decisions of a neural network by 
decomposition. It redistributes the function value (i.e., the output of the neural 
network) to the input variables in a layer-by-layer fashion, while utilizing the 
mathematical tool of (first-order) Taylor expansion to determine the proportion 
or relevance assigned to the lower layer elements in the redistribution process 
(i.e., their respective contributions). This approach is closely connected to the 
LRP method (see Sect. 2.4). Since most LRP rules can be interpreted as a Taylor 
decomposition of the relevance function with a specific choice of root point, DTD 
can be seen as the mathematical framework of LRP. 

DTD models the relevance of a neuron k at layer l as a simple relevance 
function of the lower-layer activations, i.e., 


R;(a) = max(0, >, Q;Wik)Ck; (2) 


where a = [a,...a@q] are the activations at layer l — 1, wi, are the weights 
connecting neurons i (at layer l— 1) and k (at layer l), and cx is a constant. This 
model is certainly valid at the output layer (as Rx is initialized with the network 
output f(x)). Through an inductive argument the authors of [60] proved that 
this model also (approximatively) holds at intermediate layers. By representing 
this simple function as Taylor expansion around a root point a, i.e., 


Ry(a) = Re(a) + > (a — ãi) : V[R(a)]; + Le, (3) 
0 0 


redistributed relevance 


DTD tells us how to meaningfully redistribute relevance from layer l to layer /—1. 
This redistribution process is iterated until the input layer. Different choices 
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of root point are recommended for different types of layers (conv layer, fully 
connected layer, input layer) and lead to different LRP redistribution rules [58]. 


GitHub Repo: https://github.com/chr5tphr/zennit 
https://github.com/albermax/innvestigate 


Discussion: DTD is a theoretically motivated explanation framework, which 
redistributes relevance from layer to layer in a meaningful manner by utilizing 
the concept of Taylor expansion. The method is highly efficient in terms of com- 
putation and can be adapted to the specific properties of a model and its layers 
(e.g., by the choice of root point). As for LRP, it is usually not straight for- 
ward to adapt DTD to novel model architectures (see e.g. local renormalization 
layers [18]). 


2.6 Prediction Difference Analysis (PDA) 


Idea: At the 2017 ICLR conference, Zintgraf et al. [91] presented the Prediction 
Difference Analysis (PDA) method. The method is based on the previous idea 
presented by [69] where, for a given prediction, each input feature is assigned a 
relevance value with respect to a class c. The idea of PDA is that the relevance of 
a feature x; can be estimated by simply measuring how the prediction changes 
when the feature is unknown, i.e., the difference between p(c|x) and p(c|x\;), 
where x\; denotes the set of all input features except xi. Now to evaluate the 
prediction, specifically to find p(c|x\;) there are three possibilities: (1) label the 
feature as unknown, (2) re-train the classifier omitting the feature, or (3) simu- 
late the absence of a feature by marginalizing the feature. With that a relevance 
vector (WE,);=1..m (whereby m represent the number of features) is generated, 
that is of the same size as the input and thus reflects the relative importance of 
all features. A large prediction difference indicates that the feature contributed 
significantly to the classification, while a small difference indicates that the fea- 
ture was not as important to the decision. So specifically, a positive value WE; 
means that the feature contributed to the evidence for the class of interest and 
much more so that removing the feature would reduce the classifier’s confidence 
in the given class. A negative value, on the other hand, means that the feature 
provides evidence against the class: Removing the feature also removes poten- 
tially contradictory or disturbing information, and makes the classifier more 
confident in the class under study. 


GitHub Repo: https://github.com/lmzintgraf/Deep Vis-PredDiff 


Discussion: Making neural network decisions interpretable through visualiza- 
tion is important both to improve models and to accelerate the adoption of 
black-box classifiers in application areas such as medicine. In the original paper 
the authors illustrate the method in experiments on natural images (ImageNet 
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data), as well as medical images (MRI brain scans). A good discussion can be 
found in: https://openreview.net /forum?id=BJ5UeU9xx 


2.7 TCAV (Testing with Concept Activation Vectors) 


Idea: TCAV [41] is a concept-based neural network approach that aims to 
quantify how strongly a concept, such as colour, influences classification. TCAV 
is based on the idea of concept activation vectors (CAV), which describe how 
neural activations influence the presence or absence of a user-specific concept. To 
calculate such a CAV, two data sets must first be collected and combined: One 
dataset containing images representing the concept and one dataset consisting 
of images in which this concept is not present. Then a logistic regression model 
is trained on the combined dataset to classify whether the concept is present 
in an image. The activations of the user-defined layer of the neural network 
serve as features for the classification model. The coefficients of the logistic 
regression model are then the CAVs. For example, to investigate how much the 
concept “stripped” contributes to the classification of an image as “zebra” by 
a convolutional neural network, a dataset representing the concept “stripped” 
and a random dataset in which the concept “stripped” is not present must be 
assembled. From the CAVs, the conceptual sensitivity can be calculated, which 
is the product of the CAV and the derivative of the classification (of the original 
network) with respect to the specified neural network layer and class. Conceptual 
sensitivity thus indicates how strongly the presence of a concept contributes to 
the desired class. 

While the CAV is a local explanation as it relates to a single classification, 
the TCAV combines the CAVs across the data into a global explanation method 
and thus answers the question of how much a concept contributed overall to 
a given classification. First, the CAVs are calculated for the entire dataset for 
the selected class, concept and level. Then TCAV calculates the ratio of images 
with positive conceptual sensitivity, which indicates for how many images the 
concept contributed to the class. This ratio is calculated multiple times, each 
time using a different “negative” sample where the concept is not present, and a 
two-tailed Student t-test [77] is applied to test whether the conceptual sensitivity 
is significantly different from zero (the test part is where the “T” in TCAV comes 
from). 


GitHub Repo: https://github.com/tensorflow/tcav 


Discussion: TCAV can be applied to detect concept sensitivity for image clas- 
sifiers that are gradient-based, such as deep neural networks. TCAV can also 
be used to analyze fairness aspects, e.g. whether gender or attributes of pro- 
tected groups are used for classification. Very positive is that TCAV can be 
used by users without machine learning expertise, as the most important part 
is collecting the concept images, where domain expertise is important. TCAV 
allows to test a classification model for arbitrary concepts, even if the model 
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was not explicitly trained on them. The technique can be used to study whether 
a network learned “flawed” concepts, such as spurious correlations. Detecting 
flawed concepts can help to improve the model. For example, it could be studied 
how important the presence of snow was for classifying wolves on images, and 
if it turns out to be important, adding images with wolves without snow might 
improve the robustness of the model. One drawback can be seen in the effort for 
labeling and collecting new data. Some concepts might also be too abstract to 
test, as the collection of a concept dataset might be difficult. How would one, 
for example, collect a dataset of images representing the concept “happiness” ? 
Furthermore, TCAV may not work well with shallower neural networks, as only 
deeper networks learn more abstract concepts. Also, the technique is also not 
applicable to text and tabular data, but mainly to image data? (last accessed: 
21-Feb-2022). A practical example from the medical domain can be found in [19]. 


2.8 XGNN (Explainable Graph Neural Networks) 


Idea: The XGNN method [88] is a post-hoc method that operates on the model 
level, meaning that it does not strive to provide individual example-level expla- 
nations. RL drives a search to find an adequate graph starting by a randomly 
chosen node or a relatively small graph, as defined by prior knowledge. The RL 
algorithm follows two rewards at the same time: first, it tries to increase the 
performance of the GNN, but secondly to keep generating valid graphs, depend- 
ing on the domain requirements. The action space contains only edge addition 
for edges in the existing graph or an enhancement with a new node. In the case 
where the action has a non-desirable contribution, a negative reward is provided. 


GitHub Repo: https://github.com/divelab/DIG/tree/dig/benchmarks/xgra 
ph/supp/XGNN and pseudocode in the paper. 


Discussion: This explanation method is invented particularly for the task of 
graph classifications. The returned graphs are the ones that were the most rep- 
resentative for the GNN decision and usually have a particular property that 
is ingrained to make the validation possible. It is worth to mention that this is 
the only method that provides mode-level explanations for GNN architectures. 
The use of RL is justified by the fact that the search for the explanation graph 
is non-differentiable, since it is not only driven by the performance but also 
by the plausibility and validity of the generated graph. Because the training of 
GNNs involves aggregations and combinations, this is an efficient way to over- 
come the obstacle of non-differentiation. The provided explanation is considered 
to be more effective for big datasets, where humans don’t have the time to check 
each example’s explanation individually. A disadvantage can be seen by the fact 
that the research idea is based on the assumption that network motifs that are 
the result of this explanation method are the ones on which the GNN is most 


5 For discussion see: https://openreview.net /forum?id=S1viikbCW. 
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“responsive”; nevertheless, this is not entirely true, since one does not know if 
other graph information was also important for the decision of the network. The 
results of the explanations are also non-concrete since in many cases ground 
truth is missing. That leads to a rather weak validation that bases on abstract 
concepts and properties of the discovered graphs, such as if they contain cycles 
or not. 


2.9 SHAP (Shapley Values) 


Note that the concepts described in this section also apply to the methods pre- 
sented in Sects. 2.9-2.12. 

Methods in this family are concerned with explanations for the model f at 
some individual point x*. They are based on a value function eg where S is a 
subset of variable indexes S C {1,...,p}. Typically, this function is defined as 
the expected value for a conditional distribution in which conditioning applies 
to all variables in a subset of S 


es = Elf(x)|@s = as). (4) 


Expected value is typically used for tabular data. In contrast, for other data 
modalities, this function is also often defined as the model prediction at x* after 
zeroing out the values of variables with indices outside S. Whichever definition 
is used, the value of eg can be thought of as the model’s response once the 
variables in the subset S are specified. 

The purpose of attribution is to decompose the difference f(a*) — eg into 
parts that can be attributed to individual variables (see Fig. 3A). 


Idea: Assessing the importance of variable i is based on analysing how adding 
variable 7 to the set S will affect the value of the function eg. The contribution 
of a variable i is denoted by ¢(7) and calculated as weighted average over all 
possible subsets S' 


2 so BS ea 8 (5) 


| 
SC{1,...,p}/{i} á 


This formula is equivalent to 


oli) = Ti 5 €before(7,i)U{i} — Cbefore(r,i)» (6) 
TEH 

where JT is a set of all orderings of p variables and before(z, i) stands for subset of 
variables that are before variable 7 in the ordering 7. Each ordering corresponds 

to set of values eg that shift from eg to f(x*) (see Fig. 3B). 
In summary, the analysis of a single ordering shows how adding consecutive 
variables changes the value of the eg function as presented in Fig. 3B. SHAP 
[51] arises as an averaging of these contributions over all possible orderings. This 
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algorithm is an adaptation of Shapley values to explain individual predictions 
of machine learning models. Shapley values were initially proposed to distribute 
payouts fairly in cooperative games and are the only solution based on axioms 
of efficiency, symmetry, dummy, and additivity. 


es f(x*) 
B) 
ne Si 1,2 
eg €1,2,3,4 
C) D) E) 
x1 X1 
AN MA 
X1 X2 X3 X4 X2 —X3 X4 X2 X3 X4 
NA ee Sol pA sN [su 
f(x) f(x) f(x) 


Fig. 3. Panel A. The methods presented in Sects. 2.9-2.12 explain the difference in 
prediction between a particular observation (x) and a baseline value. Often for the 
baseline value is taken the expected value from the model’s prediction distribution. 
The methods described here distribute this difference eg — f(x*) among the variables 
in the model. Panel B. Attributions are based on the changes in the expected value of 
the model prediction due to successive conditioning. For a given sequence of variable 
order (here, 1, 2, 3, 4) one can calculate how adding another variable will change the 
expected prediction of the model. Panel C. For the SHAP method, the variables have 
no structure, so any sequence of variables is treated as equally likely. Panel D. The ASV 
method takes into account a causal graph for variables. Only variable orderings that 
are consistent with this dependency graph are considered in the calculation of attri- 
butions. Causal graph controls where to assign attributions in the case of dependent 
variables. Panel E. The Shapley Flow method also considers a causal graph. It allo- 
cates attributions to the edges in this graph, showing how these attributions propagate 
through the graph. 


GitHub Repo: https: //github.com/slundberg/shap 


Discussion: SHAP values sum up to the model prediction, i.e. 


f(z*) = eø + D o(i). (7) 
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In some situations, this is a very desirable property, e.g. if a pricing model pre- 
dicts the value of a certain product, it is desirable to decompose this prediction 
additively into components attributable to individual variables. SHAP draws 
from a rich theoretical underpinning in game theory and fulfils desirable axioms, 
for example, that features that did not contribute to the prediction get an attri- 
bution of zero. Shapley values can be further combined to global interpretations 
of the model, such as feature dependence plots, feature importance and interac- 
tion analysis. 

One large drawback of Shapley values is their immense computational com- 
plexity. For modern models such as deep neural networks and high dimensional 
inputs, the exact computation of Shapley values is intractable. However, model- 
specific implementations exist for tree-based methods (random forest, xgboost 
etc.) or additive models [49]. With care, one should use certain estimation ver- 
sions of SHAP, such as KernelSHAP, because those are slow to compute. Fur- 
thermore, when features are dependent, Shapley values will cause extrapolation 
to areas with low data density. Conditional versions exist [50] (for tree-based 
models only), but the interpretation changes which is a common pitfall [57]. 
SHAP explanations are not sparse since to each feature that changes the predic- 
tion, a Shapley value different from zero is attributed, no matter how small the 
influence. If sparse explanations are required, counterfactual explanations might 
be preferable. 


2.10 Asymmetric Shapley Values (ASV) 


Idea: SHAP values are symmetrical. This means that if two variables have the 
same effect on the model’s behaviour, e.g. because they take identical values, they 
will receive equal attributions. However, this is not always a desirable property. 
For example, if we knew that one of the variables has a causal effect on the 
other, then it would make more sense to assign the entire attribution to the 
source variable. 

Asymmetric Shapley values (ASV) [22,23] allow the use of additional knowl- 
edge about the causal relations between variables in the model explanation pro- 
cess. A cause-effect relationship described in the form of causal graph allows the 
attribution of variables to be redistributed in such a way that the source vari- 
ables have a greater attribution, providing effect on both the other dependent 
variables and the model predictions (see Fig. 3D). SHAP values are a special case 
of ASV values, where the casual graph is reduced to a set of unrelated vertices 
(see Fig. 3C). 

The ASV values for variable i are also calculated as the average effect of 
adding a variable to a coalition of other variables, in the same way as expressed 
in Eq. (6). The main difference is that not all possible orders of variables are 
considered, but only the orders are consistent with the casual graph. Thus, a 
larger effect will be attributed to the source variables. 
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GitHub Repo: https://github.com/nredell/shapFlex 


Discussion: In order to use the ASV, a causal graph for the variables is needed. 
Such a graph is usually created based on domain knowledge. Examples include 
applications in bioinformatics with signalling pathways data for which the under- 
lying causal structure is experimentally verified or application in social sciences, 
where sociodemographic data in which the direction of the relationship can be 
determined based on expert knowledge (e.g., age affects income rather than 
income affects age). 

A particular application of the ASV value is the model fairness analysis. If 
a protected attribute, such as age or sex, does not directly affect the model’s 
score, its SHAP attribute will be zero. But if the protected attribute is the cause 
for other proxy variables, then the ASV values will capture this indirect effect 
on the model. 


2.11 Break-Down 


Idea: Variable contribution analysis is based on examining the change in eg 
values along with a growing set of variables described by a specific order (see 
Fig. 3B). If the model f has interactions, different orderings of the variables 
may lead to different contributions. The SHAP values average over all possible 
orderings (see Eq.6), thus leads to additive contributions and neglecting the 
interactions. 

An alternative is to analyze different orderings to detect when one variable 
has different contributions depending on what other variables precede it. This is 
a sign of interaction. The Break-Down method (see [16,17]) analyzes the various 
orders to identify and visualize interactions in the model. The final attributions 
are determined based on a single ordering which is chosen based on greedy 
heuristics. 


GitHub Repo: https://github.com/ModelOriented/DALEX 


Discussion: Techniques such as SHAP generate explanations in the form of 
additive contributions. However, these techniques are often used in the analysis 
of complex models, which are often not additive. [26] shows that for many tabular 
datasets, an additive explanation may be an oversimplification, and it may lead 
to a false belief that the model behaves in an additive way. 


2.12 Shapley Flow 


Idea: Like for Asymmetric Shapley Values (ASV), Shapley Flow [84] also allows 
the use of the dependency structure between variables in the explanation pro- 
cess. As in ASV, the relationship is described by a causal graph. However, unlike 
ASV and other methods, attribution is assigned not to the nodes (variables) but 
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the edges (relationships between variables). An edge in a graph is significant if its 
removal would change the predictions of the model (see Fig. 3E). The edge attri- 
bution has the additional property that for each explanation, boundaries hold the 
classical Shapley values. The most extreme explanation boundary corresponds 
to the ASV method. The Shapley Flow method determines the attributions for 
each edge in the causal graph. 


GitHub Repo: https: //github.com/nathanwang000/Shapley-Flow 


Discussion: Shapley Flow attribution analysis carries a lot of information about 
both the structure of the relationship between variables and its effect of particu- 
lar groups of variables (explanation boundaries) on the predictions. On the rather 
disadvantageous side is that it requires knowledge of the dependency structure 
in the form of a directed causal graph, which limits the number of problems in 
which it can be applied. For readability reasons, it is limited to small numbers 
of variables. Also it requires definition of a background case, i.e. reference obser- 
vation. Potential explanations may vary depending on the reference observation 
chosen. 


2.13 Textual Explanations of Visual Models 


Idea: The generation of textual descriptions of images is addressed by sev- 
eral machine learning models that contain both a part that processes the input 
images - typically a convolutional neural network (CNN) - and one that learns 
an adequate text sequence, usually a recurrent neural network (RNN). Those 
two parts cooperate for the production of image descriptive sentences that pre- 
supposes that a classification task is successfully accomplished. One of the first 
benchmark datasets that contained image descriptions was already invented in 
2014, the Microsoft COCO (MS-COCO) [48]. The models that achieve a good 
performance classification, first detect components and concepts of the image 
and then construct sentences where objects, subjects as well as their character- 
istics are connected by verbs. The problem of semantic enrichment of images 
for language-related tasks is addressed in a number of ways (see, for example, 
the Visual Genome project [45]); however, in most cases, such descriptions are 
not directly tied to visual recognition tasks. Nevertheless, an advantage is that 
textual descriptions are easier to analyze and validate than attribution maps. 
It is important to note that the mere description of the image’s content is not 
equivalent to an explanation of the decision-making process of the neural net- 
work model. Unless the produced sentences contain the unique attributes that 
help differentiate between the images of each class, the content of the words 
should not be considered class-relevant content. A solution to this problem is 
proposed in [31]. This method’s main goal is to do exactly that; to find those 
characteristics that are discriminative, since they were used by the neural net- 
work models to accomplish the task - those exactly need to be present in the 
generated text. To achieve this, the training does not just use the relevance 
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loss, which generates descriptions relevant to the predicted class based on con- 
text borrowed from a fine-grained image recognition model through conditional 
probabilities. A discriminative loss is invented to generate sentences rich in class- 
discriminative features. The introduced weight update procedure consists of two 
components, one based on the gradient of relevance loss and the second based 
on the gradient of discriminative loss, so that descriptions that are both relevant 
to the predicted class and contain words with high discriminative capacity are 
rewarded. The reinforcement learning method REINFORCE [85] is used for the 
backpropagation of the error through sampling during the training process. 


GitHub Repo: https://github.com/LisaAnne/ECCV2016 


Discussion: High METEOR [13] and CIDEr [82] scores for relevant explana- 
tions were measured for the generated sentences. It is necessary to compare the 
resulting explanations with experts since they only know the difference between 
sentences that correctly describe the visual content and ones that concentrate 
on what occurs only in the class the images belong. This is positive and nega- 
tive at the same time; unfortunately, there is no way to check how much of the 
generated explanation is consistent without domain knowledge. Furthermore, 
data artefacts can also influence both the performance and explanation quality 
negatively. Overall though, even ablation studies where parts of the model were 
tested separately, showed that the components individually had a higher per- 
formance than when trained alone. That indicates that the common training of 
visual processing and textual explanation generation is beneficial for each part 
individually. 


2.14 Integrated Gradients 


Idea: The Integrated Gradients method [80] is based on two fundamental 
axioms, sensitivity and implementation invariance. Sensitivity means that non- 
zero attributions are given to every input and baseline that differ in one fea- 
ture but have different predictions. Implementation invariance means that if two 
models behave identical/are functionally equivalent, then attributions must be 
identical. Although these two axioms sound very natural, it turns out that many 
attribution methods do not have these properties. In particular, when a model 
has flattened predictions for a specific point of interest, the gradient in the point 
of interest zeroes out and does not carry information useful for the explanation. 

The approach proposed by the Integrated Gradients method for model f 
aggregates the gradients -5 of f(a) ) computed along the path P the point of 
interest x to the highlighted observation - the baseline «* (for computer vision 
this could be a black image and for text an empty sentence). 

More formally, for ith feature, Integrated Gradients are defined as 


=. Of (x* + ala — 2*)) 


IntegratedGrads;(x) = (a; — Bis da. 
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The integral can be replaced by a sum over a set of alpha values in the 
interval [0,1]. 


GitHub Repo: https://github.com/ankurtaly /Integrated-Gradients 


Discussion: Integrated Gradients is a widespread technique for explaining deep 
neural networks or other differentiable models. It is a theoretically sound app- 
roach based on two desirable properties: sensitivity and implementation invari- 
ance. In addition, it is computationally efficient and uses gradient information 
at a few selected points a. The three main drawbacks are: (1) need for the base- 
line observation, selection of which significantly influence the attributions, (2) 
works only for differentiable models, suitable for neural networks but not, e.g., 
for decision trees, (3) by default, gradients are integrated along the shortest path 
between the baseline and the point of interest. Depending on the topology of the 
data, this path does not always make sense and cover the data. Furthermore, 
deep models usually suffer from the gradient shattering problem [12], which may 
negative affect the explanation (see discussion in [70]). Extensions to this method 
are proposed to overcome the above drawbacks. 


2.15 Causal Models 


Description: In the work of Madumal et al. [53] a structural causal model 
[29] is learned, which can be considered an extension of Bayesian Models [44,71] 
of the RL environment with the use of counterfactuals. It takes into account 
events that would happen or environment states that would be reached under 
different actions taken by the RL agent. Ultimately, the goal of any RL agent 
is to maximize a long-term reward; the explanation provides causal chains until 
the reward receiving state is reached. The researchers pay attention to keep 
the explanations minimally complete, by removing some of the intermediate 
nodes in the causal chains, to conform to the explanation satisfaction conditions 
according to the Likert scale [33]. The counterfactual explanation is computed by 
comparing causal chain paths of actions not chosen by the agent (according to the 
trained policy). To keep the explanation as simple as possible, only the differences 
between the causal chains comprise the returned counterfactual explanation. 


GitHub Repo: No Github Repo 


Discussion: Model-free reinforcement learning with a relatively small state 
and action space has the advantage that we can explain how the RL agent 
takes its decisions in a causal way; since neural networks base their decisions on 
correlations, this is one of the first works towards causal explanations. The user 
has the ability to get answers to the questions “why” and “why not” an action 
was chosen by the agent. The provided explanations are appropriate according 
to satisfiability, ethics requirements and are personalized to the human mental 
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model by the use of a dedicated user interface. On the rather negative side is 
that this explanation method is evaluated on problems with very small state 
and action space (9 and 4 correspondingly). Current RL problems have much 
larger state and action spaces and the solution can be found with the use of 
Deep Reinforcement Learning [27,32]. The reason is, that the structural model 
of the environment dynamics are not known a priori and must be discovered and 
approximated during exploration. Furthermore, this work applies only to the 
finite domain, although the authors note that it will be part of their research 
work to extend it to continuous spaces. 


2.16 Meaningful Perturbations 


Idea: This approach was proposed by Fong and Vedaldi [21] and can be regarded 
as model-agnostic, perturbation-based explanation method. Thus, the explana- 
tion is computed solely based on the reaction of the model to a perturbed (or 
occluded) input sample. For a given sample x, the method aims to synthesize a 
sparse occlusion map (i.e., the explanation) that leads to the maximum drop of 
the model’s prediction f(x), relative do the original prediction with the unper- 
turbed x. Thus, compared to simple occlusion-based techniques which naively 
perturb a given sample by sequentially occluding parts of it, the Meaningful Per- 
turbation algorithm aims to directly learn the explanation by formulating the 
explanation problem as a meta-prediction task and using tools from optimiza- 
tion to solve it. Sparsity constraints ensure that the search focuses on finding the 
smallest possible perturbation mask that has the larger effect on the certainty 
of the classification performance. 


GitHub Repo: https://github.com/ruthcfong/perturb_explanations 


Discussion: As other model agnostic approaches, Meaningful Perturbations is 
a very flexible method, which can be directly applied to any machine learning 
model. The approach can be also interpreted from a rate-distortion perspec- 
tive [43]. Since the Meaningful Perturbations method involves optimization, it 
is computationally much more demanding than propagation-based techniques 
such as LRP. Also it is well-known that the perturbation process (occlusion or 
deletion can be seen as a particular type of perturbation), moves the sample 
out of the manifold of natural images and thus can introduce artifacts. The 
use of generative models have been suggested to overcome this out-of-manifold 
problem [1]. 


2.17 EXplainable Neural-Symbolic Learning (X-NeSyL) 


Idea: Symbolic AI is an emerging field that has been shown to contribute 
immensely to Explainable AI. Neuro-Symbolic methods [24] incorporate prior 
human knowledge for various tasks such as concept learning and at the same time 
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they produce output that is more interpretable, such as mathematical equations 
or Domain-Specific Languages (DSL) [55]. 

A research work that is dedicated to using symbolic knowledge of the domain 
experts, expressed as a knowledge graph (KG), to align it with the explanations 
of a neural network is the EXplainable Neural-Symbolic Learning (X-NeSyL) 
[20]. The researchers start with the goal to encourage the neural network that 
performs classification to assign feature importances to the object’s parts in a 
way that corresponds to the compositional way humans classify. After using 
state-of-the-art CNN architectures and applying methods such as SHAP (see 
Sect. 2.9) to quantify the positive and negative influence of each detected feature, 
a graph is built that encompasses constraints and relations elicited from the 
computed importances. This graph is compared to the KG provided by human 
experts. A designated loss that punishes non-overlap between these two has been 
shown to boost explainability and in some cases performance. 


GitHub Repo: https://github.com/JulesSanchez/X-NeSyL, 
https://github.com/JulesSanchez/MonuMAI- AutomaticStyleClassification 


Discussion: This method can be seen as an explainability-by-design approach. 
That means that at each step of the training process, it is made sure that the end 
result will be interpretable. This is not an ad-hoc method; the training contains 
a loss to guide the neural network towards explanations that have a human- 
expert like structure. Furthermore, SHAP values provide intermediate feature 
relevance results that are straightforward to understand. Disadvantageous is that 
the same thing that fosters explainability, contributes to the negatives of this 
approach, namely that it needs domain-specific knowledge. This is not always 
easy to gather, it may be contradicting if several experts are involved and in 
that way constraints the network to compute in a specific way. The researchers 
comment on that particular issue and exercise their method with many datasets, 
test several CNN architectures and provide performance results with established 
as well as newly invented methods to see where and how the human-in-the-loop 
[37] works in practice. 


3 Conclusion and Future Outlook 


In the future, we expect that the newly invented xAI methods will capture causal 
dependencies. Therefore, it will be important to measure the quality of explana- 
tions so that an xAI method achieves a certain level of causal understanding [65] 
for a user with effectiveness, efficiency and satisfaction in a given context of use 
[34]. Successful xAI models in the future will also require new human-AI inter- 
faces [36] that enable contextual understanding and allow a domain expert to ask 
questions and counterfactuals [35] (“what-if” questions). This is where a human- 
in-the-loop can (sometimes - not always, of course) bring human experience and 
conceptual knowledge to AI processes [37]. Such conceptual understanding is 
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something that the best AI algorithms in the world (still) lack, and this is where 
the international xAI community will make many valuable contributions in the 
future. 
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Abstract. An increasing number of model-agnostic interpretation tech- 
niques for machine learning (ML) models such as partial dependence 
plots (PDP), permutation feature importance (PFI) and Shapley val- 
ues provide insightful model interpretations, but can lead to wrong con- 
clusions if applied incorrectly. We highlight many general pitfalls of 
ML model interpretation, such as using interpretation techniques in the 
wrong context, interpreting models that do not generalize well, ignoring 
feature dependencies, interactions, uncertainty estimates and issues in 
high-dimensional settings, or making unjustified causal interpretations, 
and illustrate them with examples. We focus on pitfalls for global meth- 
ods that describe the average model behavior, but many pitfalls also 
apply to local methods that explain individual predictions. Our paper 
addresses ML practitioners by raising awareness of pitfalls and identi- 
fying solutions for correct model interpretation, but also addresses ML 
researchers by discussing open issues for further research. 
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1 Introduction 


In recent years, both industry and academia have increasingly shifted away 
from parametric models, such as generalized linear models, and towards non- 
parametric and non-linear machine learning (ML) models such as random forests, 
gradient boosting, or neural networks. The major driving force behind this devel- 
opment has been a considerable outperformance of ML over traditional models 
on many prediction tasks [32]. In part, this is because most ML models han- 
dle interactions and non-linear effects automatically. While classical statistical 
models — such as generalized additive models (GAMs) — also support the inclu- 
sion of interactions and non-linear effects, they come with the increased cost of 
having to (manually) specify and evaluate these modeling options. The benefits 
of many ML models are partly offset by their lack of interpretability, which is 
of major importance in many applications. For certain model classes (e.g. lin- 
ear models), feature effects or importance scores can be directly inferred from 
the learned parameters and the model structure. In contrast, it is more diffi- 
cult to extract such information from complex non-linear ML models that, for 
instance, do not have intelligible parameters and are hence often considered 
black boxes. However, model-agnostic interpretation methods allow us to har- 
ness the predictive power of ML models while gaining insights into the black-box 
model. These interpretation methods are already applied in many different fields. 
Applications of interpretable machine learning (IML) include understanding pre- 
evacuation decision-making [124] with partial dependence plots [36], inferring 
behavior from smartphone usage [105,106] with the help of permutation feature 
importance [107] and accumulated local effect plots [3], or understanding the 
relation between critical illness and health records [70] using Shapley additive 
explanations (SHAP) [78]. Given the widespread application of interpretable 
machine learning, it is crucial to highlight potential pitfalls, that, in the worst 
case, can produce incorrect conclusions. 

This paper focuses on pitfalls for model-agnostic IML methods, i.e. meth- 
ods that can be applied to any predictive model. Model-specific methods, in 
contrast, are tied to a certain model class (e.g. saliency maps [57] for gradient- 
based models, such as neural networks), and are mainly considered out-of-scope 
for this work. We focus on pitfalls for global interpretation methods, which 
describe the expected behavior of the entire model with respect to the whole 
data distribution. However, many of the pitfalls also apply to local explanation 
methods, which explain individual predictions or classifications. Global meth- 
ods include the partial dependence plot (PDP) [36], partial importance (PI) 
[19], accumulated local affects (ALE) [3], or the permutation feature impor- 
tance (PFI) [12,19,33]. Local methods include the individual conditional expec- 
tation (ICE) curves [38], individual conditional importance (ICI) [19], local 
interpretable model-agnostic explanations (LIME) [94], Shapley values [108] and 
SHapley Additive exPlanations (SHAP) [77,78] or counterfactual explanations 
[26,115]. Furthermore, we distinguish between feature effect and feature impor- 
tance methods. A feature effect indicates the direction and magnitude of a change 
in predicted outcome due to changes in feature values. Effect methods include 
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Fig. 1. Selection of popular model-agnostic interpretation techniques, classified as local 
or global, and as effect or importance methods. 


Shapley values, SHAP, LIME, ICE, PDP, or ALE. Feature importance meth- 
ods quantify the contribution of a feature to the model performance (e.g. via a 
loss function) or to the variance of the prediction function. Importance methods 
include the PFI, ICI, PI, or SAGE. See Fig. 1 for a visual summary. 

The interpretation of ML models can have subtle pitfalls. Since many of 
the interpretation methods work by similar principles of manipulating data and 
“probing” the model [100], they also share many pitfalls. The sources of these 
pitfalls can be broadly divided into three categories: (1) application of an unsuit- 
able ML model which does not reflect the underlying data generating process 
very well, (2) inherent limitations of the applied IML method, and (3) wrong 
application of an IML method. Typical pitfalls for (1) are bad model generaliza- 
tion or the unnecessary use of complex ML models. Applying an IML method in 
a wrong way (3) often results from the users’ lack of knowledge of the inherent 
limitations of the chosen IML method (2). For example, if feature dependencies 
and interactions are present, potential extrapolations might lead to mislead- 
ing interpretations for perturbation-based IML methods (inherent limitation). 
In such cases, methods like PFI might be a wrong choice to quantify feature 
importance. 


Table 1. Categorization of the pitfalls by source. 


Sources of pitfall Sections 

Unsuitable ML model 3,4 

Limitation of IML method Di Bel, p2 Bb DZ 
Wrong application of IML method | 2, 5.2, 5.3, 7, 8, 9.3, 10 


Contributions: We uncover and review general pitfalls of model-agnostic inter- 
pretation techniques. The categorization of these pitfalls into different sources 
is provided in Table 1. Each section describes and illustrates a pitfall, reviews 
possible solutions for practitioners to circumvent the pitfall, and discusses open 
issues that require further research. The pitfalls are accompanied by illustrative 
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examples for which the code can be found in this repository: https: //github.com/ 
compstat-lmu/code_pitfalls_iml.git. In addition to reproducing our examples, we 
invite readers to use this code as a starting point for their own experiments and 
explorations. 


Related Work: Rudin et al. [96] present principles for interpretability and dis- 
cuss challenges for model interpretation with a focus on inherently interpretable 
models. Das et al. [27] survey methods for explainable AI and discuss challenges 
with a focus on saliency maps for neural networks. A general warning about using 
and explaining ML models for high stakes decisions has been brought forward 
by Rudin [95], in which the author argues against model-agnostic techniques 
in favor of inherently interpretable models. Krishnan [64] criticizes the general 
conceptual foundation of interpretability, but does not dispute the usefulness of 
available methods. Likewise, Lipton [73] criticizes interpretable ML for its lack 
of causal conclusions, trust, and insights, but the author does not discuss any 
pitfalls in detail. Specific pitfalls due to dependent features are discussed by 
Hooker [54] for PDPs and functional ANOVA as well as by Hooker and Mentch 
[55] for feature importance computations. Hall [47] discusses recommendations 
for the application of particular interpretation methods but does not address 
general pitfalls. 


2 Assuming One-Fits-All Interpretability 


Pitfall: Assuming that a single IML method fits in all interpretation contexts 
can lead to dangerous misinterpretation. IML methods condense the complex- 
ity of ML models into human-intelligible descriptions that only provide insight 
into specific aspects of the model and data. The vast number of interpretation 
methods make it difficult for practitioners to choose an interpretation method 
that can answer their question. Due to the wide range of goals that are pursued 
under the umbrella term “interpretability”, the methods differ in which aspects 
of the model and data they describe. 

For example, there are several ways to quantify or rank the features according 
to their relevance. The relevance measured by PFI can be very different from 
the relevance measured by the SHAP importance. If a practitioner aims to gain 
insight into the relevance of a feature regarding the model’s generalization error, 
a loss-based method (on unseen test data) such as PFI should be used. If we aim 
to expose which features the model relies on for its prediction or classification — 
irrespective of whether they aid the model’s generalization performance — PFI 
on test data is misleading. In such scenarios, one should quantify the relevance 
of a feature regarding the model’s prediction (and not the model’s generalization 
error) using methods like the SHAP importance [76]. 

We illustrate the difference in Fig. 2. We simulated a data-generating process 
where the target is completely independent of all features. Hence, the features 
are just noise and should not contribute to the model’s generalization error. 
Consequently, the features are not considered relevant by PFI on test data. 
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However, the model mechanistically relies on a number of spuriously correlated 
features. This reliance is exposed by marginal global SHAP importance. 

As the example demonstrates, it would be misleading to view the PFI com- 
puted on test data or global SHAP as one-fits-all feature importance techniques. 
Like any IML method, they can only provide insight into certain aspects of model 
and data. 

Many pitfalls in this paper arise from situations where an IML method that 
was designed for one purpose is applied in an unsuitable context. For example, 
extrapolation (Sect.5.1) can be problematic when we aim to study how the 
model behaves under realistic data but simultaneously can be the correct choice 
if we want to study the sensitivity to a feature outside the data distribution. 

For some IML techniques — especially local methods — even the same method 
can provide very different explanations, depending on the choice of hyperparam- 
eters: For counterfactuals, explanation goals are encoded in their optimization 
metrics [26,34] such as sparsity and data faithfulness; The scope and meaning 
of LIME explanations depend on the kernel width and the notion of complexity 
[8,37]. 


Solution: The suitability of an IML method cannot be evaluated with respect to 
one-fits-all interpretability but must be motivated and assessed with respect to 
well-defined interpretation goals. Similarly, practitioners must tailor the choice 
of the IML method and its respective hyperparameters to the interpretation 
context. This implies that these goals need to be clearly stated in a detailed 
manner before any analysis — which is still often not the case. 


Open Issues: Since IML methods themselves are subject to interpretation, 
practitioners must be informed about which conclusions can or cannot be drawn 
given different choices of IML technique. In general, there are three aspects to 
be considered: (a) an intuitively understandable and plausible algorithmic con- 
struction of the IML method to achieve an explanation; (b) a clear mathematical 
axiomatization of interpretation goals and properties, which are linked by proofs 
and theoretical considerations to IML methods, and properties of models and 
data characteristics; (c) a practical translation for practitioners of the axioms 
from (b) in terms of what an IML method provides and what not, ideally with 
implementable guidelines and diagnostic checks for violated assumptions to guar- 
antee correct interpretations. While (a) is nearly always given for any published 
method, much work remains for (b) and (c). 


3 Bad Model Generalization 


Pitfall: Under- or overfitting models can result in misleading interpretations 
with respect to the true feature effects and importance scores, as the model does 
not match the underlying data-generating process well [39]. Formally, most IML 
methods are designed to interpret the model instead of drawing inferences about 
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Fig. 2. Assuming one-fits-all interpretability. A default xgboost regression model 
that minimizes the mean squared error (MSE) was fitted on 20 independently and uni- 
formly distributed features to predict another independent, uniformly sampled target. 
In this setting, predicting the (unconditional) mean E[Y] in a constant model is opti- 
mal. The learner overfits due to a small training data size. Mean marginal SHAP (red, 
error bars indicate 0.05 and 0.95 quantiles) exposes all mechanistically used features. 
In contrast, PFI on test data (blue, error bars indicate 0.05 and 0.95 quantiles) con- 
siders all features to be irrelevant, since no feature contributes to the generalization 
performance. 


the data-generating process. In practice, however, the latter is often the goal of 
the analysis, and then an interpretation can only be as good as its underlying 
model. If a model approximates the data-generating process well enough, its 
interpretation should reveal insights into the underlying process. 


Solution: In-sample evaluation (i.e. on training data) should not be used to 
assess the performance of ML models due to the risk of overfitting on the train- 
ing data, which will lead to overly optimistic performance estimates. We must 
resort to out-of-sample validation based on resampling procedures such as hold- 
out for larger datasets or cross-validation, or even repeated cross-validation for 
small sample size scenarios. These resampling procedures are readily available 
in software [67,89], and well-studied in theory as well as practice [4,11,104], 
although rigorous analysis of cross-validation is still considered an open prob- 
lem [103]. Nested resampling is necessary, when computational model selection 
and hyperparameter tuning are involved [10]. This is important, as the Bayes 
error for most practical situations is unknown, and we cannot make absolute 
statements about whether a model already optimally fits the data. 

Figure 3 shows the mean squared errors for a simulated example on both 
training and test data for a support vector machine (SVM), a random forest, 
and a linear model. Additionally, PDPs for all models are displayed, which show 
to what extent each model’s effect estimates deviate from the ground truth. The 
linear model is unable to represent the non-linear relationship, which is reflected 
in a high error on both test and training data and the linear PDPs. In contrast, 
the random forest has a low training error but a much higher test error, which 
indicates overfitting. Also, the PDPs for the random forest display overfitting 
behavior, as the curves are quite noisy, especially at the lower and upper value 


General Pitfalls of Model-Agnostic Interpretation 45 


SVM] r data 
Random Forest] 4 . = Test 
Linear Model4 a . a Training 
0 50 100 150 200 250 
Means Squared Error 
x1 X2 X3 
T 1 
G 10 
a 
[a 
© 5] model 
© al = Linear Regression 
P = Random Forest 
a — SVM 
© 0: — True DGP 
D 
© 
— 
L 
< -5} 
-2 0 2 -2 0 2 -2 0 2 


Feature value 


Fig. 3. Bad model generalization. Top: Performance estimates on training and test 
data for a linear regression model (underfitting), a random forest (overfitting) and a 
support vector machine with radial basis kernel (good fit). The three features are drawn 
from a uniform distribution, and the target was generated as Y = X 2 + Xo—5X1Xe+e, 
with e ~ N(0,5).Bottom: PDPs for the data-generating process (DGP) — which is the 
ground truth — and for the three models. 


ranges of each feature. The SVM with both low training and test error comes 
closest to the true PDPs. 


4 Unnecessary Use of Complex Models 


Pitfall: A common mistake is to use an opaque, complex ML model when an 
interpretable model would have been sufficient, i.e. when the performance of 
interpretable models is only negligibly worse — or maybe the same or even better 
— than that of the ML model. Although model-agnostic methods can shed light 
on the behavior of complex ML models, inherently interpretable models still 
offer a higher degree of transparency [95] and considering them increases the 
chance of discovering the true data-generating function [23]. What constitutes 
an interpretable model is highly dependent on the situation and target audience, 
as even a linear model might be difficult to interpret when many features and 
interactions are involved. 

It is commonly believed that complex ML models always outperform more 
interpretable models in terms of accuracy and should thus be preferred. However, 
there are several examples where interpretable models have proven to be serious 
competitors: More than 15 years ago, Hand [49] demonstrated that simple models 
often achieve more than 90% of the predictive power of potentially highly com- 
plex models across the UCI benchmark data repository and concluded that such 
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models often should be preferred due to their inherent interpretability; Makri- 
dakis et al. [79] systematically compared various ML models (including long- 
short-term-memory models and multi-layer neural networks) to statistical mod- 
els (e.g. damped exponential smoothing and the Theta method) in time series 
forecasting tasks and found that the latter consistently show greater predictive 
accuracy; Kuhle et al. [65] found that random forests, gradient boosting and 
neural networks did not outperform logistic regression in predicting fetal growth 
abnormalities; Similarly, Wu et al. [120] have shown that a logistic regression 
model performs as well as AdaBoost and even better than an SVM in predicting 
heart disease from electronic health record data; Baesens et al. [7] showed that 
simple interpretable classifiers perform competitively for credit scoring, and in 
an update to the study the authors note that “the complexity and/or recency 
of a classifier are misleading indicators of its prediction performance” [71]. 


Solution: We recommend starting with simple, interpretable models such as 
linear regression models and decision trees. Generalized additive models (GAM) 
[50] can serve as a gradual transition between simple linear models and more 
complex machine learning models. GAMs have the desirable property that they 
can additively model smooth, non-linear effects and provide PDPs out-of-the- 
box, but without the potential pitfall of masking interactions (see Sect.6). The 
additive model structure of a GAM is specified before fitting the model so that 
only the pre-specified feature or interaction effects are estimated. Interactions 
between features can be added manually or algorithmically (e.g. via a forward 
greedy search) [18]. GAMs can be fitted with component-wise boosting [99]. The 
boosting approach allows to smoothly increase model complexity, from sparse 
linear models to more complex GAMs with non-linear effects and interactions. 
This smooth transition provides insight into the tradeoffs between model sim- 
plicity and performance gains. Furthermore, component-wise boosting has an 
in-built feature selection mechanism as the model is build incrementally, which 
is especially useful in high-dimensional settings (see Sect.9.1). The predictive 
performance of models of different complexity should be carefully measured and 
compared. Complex models should only be favored if the additional performance 
gain is both significant and relevant — a judgment call that the practitioner must 
ultimately make. Starting with simple models is considered best practice in data 
science, independent of the question of interpretability [23]. The comparison of 
predictive performance between model classes of different complexity can add 
further insights for interpretation. 


Open Issues: Measures of model complexity allow quantifying the trade-off 
between complexity and performance and to automatically optimize for multiple 
objectives beyond performance. Some steps have been made towards quantifying 
model complexity, such as using functional decomposition and quantifying the 
complexity of the components [82] or measuring the stability of predictions [92]. 
However, further research is required, as there is no single perfect definition of 
interpretability, but rather multiple depending on the context [30,95]. 
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5 Ignoring Feature Dependence 


5.1 Interpretation with Extrapolation 


Pitfall: When features are dependent, perturbation-based IML methods such 
as PFI, PDP, LIME, and Shapley values extrapolate in areas where the model 
was trained with little or no training data, which can cause misleading interpre- 
tations [55]. This is especially true if the ML model relies on feature interactions 
[45] — which is often the case. Perturbations produce artificial data points that 
are used for model predictions, which in turn are aggregated to produce global 
or local interpretations [100]. Feature values can be perturbed by replacing orig- 
inal values with values from an equidistant grid of that feature, with permuted 
or randomly subsampled values [19], or with quantiles. We highlight two major 
issues: First, if features are dependent, all three perturbation approaches pro- 
duce unrealistic data points, i.e. the new data points are located outside of the 
multivariate joint distribution of the data (see Fig. 4). Second, even if features 
are independent, using an equidistant grid can produce unrealistic values for the 
feature of interest. Consider a feature that follows a skewed distribution with 
outliers. An equidistant grid would generate many values between outliers and 
non-outliers. In contrast to the grid-based approach, the other two approaches 
maintain the marginal distribution of the feature of interest. 

Both issues can result in misleading interpretations (illustrative examples are 
given in [55,84]), since the model is evaluated in areas of the feature space with 
few or no observed real data points, where model uncertainty can be expected 
to be very high. This issue is aggravated if interpretation methods integrate 
over such points with the same weight and confidence as for much more realistic 
samples with high model confidence. 


Solution: Before applying interpretation methods, practitioners should check 
for dependencies between features in the data, e.g. via descriptive statistics or 
measures of dependence (see Sect. 5.2). When it is unavoidable to include depen- 
dent features in the model (which is usually the case in ML scenarios), additional 
information regarding the strength and shape of the dependence structure should 
be provided. Sometimes, alternative interpretation methods can be used as a 
workaround or to provide additional information. Accumulated local effect plots 
(ALE) [3] can be applied when features are dependent, but can produce non- 
intuitive effect plots for simple linear models with interactions [45]. For other 
methods such as the PFI, conditional variants exist [17,84,107]. In the case 
of LIME, it was suggested to focus in sampling on realistic (i.e. close to the 
data manifold) [97] and relevant areas (e.g. close to the decision boundary) [69]. 
Note, however, that conditional interpretations are often different and should 
not be used as a substitute for unconditional interpretations (see Sect. 5.3). Fur- 
thermore, dependent features should not be interpreted separately but rather 
jointly. This can be achieved by visualizing e.g. a 2-dimensional ALE plot of 
two dependent features, which, admittedly, only works for very low-dimensional 
combinations. Especially in high-dimensional settings where dependent features 
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Fig. 4. Interpretation with extrapolation. Illustration of artificial data points gen- 
erated by three different perturbation approaches. The black dots refer to observed data 
points and the red crosses to the artificial data points. 


can be grouped in a meaningful way, grouped interpretation methods might be 
more reasonable (see Sect. 9.1). 

We recommend using quantiles or randomly subsampled values over equidis- 
tant grids. By default, many implementations of interpretability methods use an 
equidistant grid to perturb feature values [41,81,89], although some also allow 
using user-defined values. 


Open Issues: A comprehensive comparison of strategies addressing extrapola- 
tion and how they affect an interpretation method is currently missing. This also 
includes studying interpretation methods and their conditional variants when 
they are applied to data with different dependence structures. 


5.2 Confusing Linear Correlation with General Dependence 


Pitfall: Features with a Pearson correlation coefficient (PCC) close to zero can 
still be dependent and cause misleading model interpretations (see Fig. 5). While 
independence between two features implies that the PCC is zero, the converse is 
generally false. The PCC, which is often used to analyze dependence, only tracks 
linear correlations and has other shortcomings such as sensitivity to outliers 
[113]. Any type of dependence between features can have a strong impact on the 
interpretation of the results of IML methods (see Sect. 5.1). Thus, knowledge 
about the (possibly non-linear) dependencies between features is crucial for an 
informed use of IML methods. 


Solution: Low-dimensional data can be visualized to detect dependence (e.g. 
scatter plots) [80]. For high-dimensional data, several other measures of depen- 
dence in addition to PCC can be used. If dependence is monotonic, Spearman’s 
rank correlation coefficient [72] can be a simple, robust alternative to PCC. 
For categorical or mixed features, separate dependence measures have been pro- 
posed, such as Kendall’s rank correlation coefficient for ordinal features, or the 
phi coefficient and Goodman & Kruskal’s lambda for nominal features [59]. 
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Fig. 5. Confusing linear correlation with dependence. Highly dependent fea- 
tures X; and Xə that have a correlation close to zero. A test (Ho: Features are inde- 
pendent) using Pearson correlation is not significant, but for HSIC, the Ho-hypothesis 
gets rejected. Data from [80]. 


Studying non-linear dependencies is more difficult since a vast variety of 
possible associations have to be checked. Nevertheless, several non-linear asso- 
ciation measures with sound statistical properties exist. Kernel-based measures, 
such as kernel canonical correlation analysis (KCCA) [6] or the Hilbert-Schmidt 
independence criterion (HSIC) [44], are commonly used. They have a solid the- 
oretical foundation, are computationally feasible, and robust [113]. In addition, 
there are information-theoretical measures, such as (conditional) mutual infor- 
mation [24] or the maximal information coefficient (MIC) [93], that can however 
be difficult to estimate [9,116]. Other important measures are e.g. the distance 
correlation [111], the randomized dependence coefficient (RDC) [74], or the alter- 
nating conditional expectations (ACE) algorithm [14]. In addition to using PCC, 
we recommend using at least one measure that detects non-linear dependencies 
(e.g. HSIC). 


5.3 Misunderstanding Conditional Interpretation 


Pitfall: Conditional variants of interpretation techniques avoid extrapolation 
but require a different interpretation. Interpretation methods that perturb fea- 
tures independently of others will extrapolate under dependent features but 
provide insight into the model’s mechanism [56,61]. Therefore, these methods 
are said to be true to the model but not true to the data [21]. 

For feature effect methods such as the PDP, the plot can be interpreted as 
the isolated, average effect the feature has on the prediction. For the PFI, the 
importance can be interpreted as the drop in performance when the feature’s 
information is “destroyed” (by perturbing it). Marginal SHAP value functions 
[78] quantify a feature’s contribution to a specific prediction, and marginal SAGE 
value functions [25] quantify a feature’s contribution to the overall prediction 
performance. All the aforementioned methods extrapolate under dependent fea- 
tures (see also Sect. 5.1), but satisfy sensitivity, i.e. are zero if a feature is not 
used by the model [25, 56,61, 110]. 
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Fig. 6. Misunderstanding conditional interpretation. A linear model was fit- 
ted on the data-generating process modeled using a linear Gaussian structural causal 
model. The entailed directed acyclic graph is depicted on the left. For illustrative pur- 
poses, the original model coefficients were updated such that not only feature X3, but 
also feature X2 is used by the model. PFI on test data considers both X3 and X2 to be 
relevant. In contrast, conditional feature importance variants either only consider X3 
to be relevant (CFI) or consider all features to be relevant (conditional SAGE value 
function). 


Conditional variants of these interpretation methods do not replace feature 
values independently of other features, but in such a way that they conform to 
the conditional distribution. This changes the interpretation as the effects of all 
dependent features become entangled. Depending on the method, conditional 
sampling leads to a more or less restrictive notion of relevance. 

For example, for dependent features, the Conditional Feature Importance 
(CFI) [17,84,107,117] answers the question: “How much does the model perfor- 
mance drop if we permute a feature, but given that we know the values of the 
other features?” [63,84,107].1 Two highly dependent features might be individu- 
ally important (based on the unconditional PFI), but have a very low conditional 
importance score because the information of one feature is contained in the other 
and vice versa. 

In contrast, the conditional variant of PDP, called marginal plot or M-plot 
[3], violates sensitivity, i.e. may even show an effect for features that are not used 
by the model. This is because for M-plots, the feature of interest is not sampled 
conditionally on the remaining features, but rather the remaining features are 
sampled conditionally on the feature of interest. As a consequence, the distri- 
bution of dependent covariates varies with the value of the feature of interest. 
Similarly, conditional SAGE and conditional SHAP value functions sample the 
remaining features conditional on the feature of interest and therefore violate 
sensitivity [25,56,61, 109]. 

We demonstrate the difference between PFI, CFI, and conditional SAGE 
value functions on a simulated example (Fig. 6) where the data-generating mech- 


' While for CFI the conditional independence of the feature of interest X; with the 
target Y given the remaining features X_; (Y  X,|X_,) is already a sufficient 
condition for zero importance, the corresponding PFI may still be nonzero [63]. 
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anism is known. While PFI only considers features to be relevant if they are 
actually used by the model, SAGE value functions may also consider a feature 
to be important that is not directly used by the model if it contains information 
that the model exploits. CFI only considers a feature to be relevant if it is both 
mechanistically used by the model and contributes unique information about Y. 


Solution: When features are highly dependent and conditional effects and 
importance scores are used, the practitioner must be aware of the distinct 
interpretation. Recent work formalizes the implications of marginal and condi- 
tional interpretation techniques [21,25,56,61,63]. While marginal methods pro- 
vide insight into the model’s mechanism but are not true to the data, their 
conditional variants are not true to the model but provide insight into the asso- 
ciations in the data. 

If joint insight into model and data is required, designated methods must be 
used. ALE plots [3] provide interval-wise unconditional interpretations that are 
true to the data. They have been criticized to produce non-intuitive results for 
certain data-generating mechanisms [45]. Molnar et al. [84] propose a subgroup- 
based conditional sampling technique that allows for group-wise marginal inter- 
pretations that are true to model and data and that can be applied to fea- 
ture importance and feature effects methods such as conditional PDPs and 
CFI. For feature importance, the DEDACT framework [61] allows to decom- 
pose conditional importance measures such as SAGE value functions into their 
marginal contributions and vice versa, thereby allowing global insight into both: 
the sources of prediction-relevant information in the data as well as into the 
feature pathways by which the information enters the model. 


Open Issues: The quality of conditional IML techniques depends on the good- 
ness of the conditional sampler. Especially in continuous, high-dimensional set- 
tings, conditional sampling is challenging. More research on the robustness of 
interpretation techniques regarding the quality of the sample is required. 


6 Misleading Interpretations Due to Feature Interactions 


6.1 Misleading Feature Effects Due to Aggregation 


Pitfall: Global interpretation methods, such as PDP or ALE plots, visualize 
the average effect of a feature on a model’s prediction. However, they can pro- 
duce misleading interpretations when features interact. Figure7 A and B show 
the marginal effect of features X; and Xə of the below-stated simulation exam- 
ple. While the PDP of the non-interacting feature X; seems to capture the 
true underlying effect of Xı on the target quite well (A), the global aggregated 
effect of the interacting feature Xə (B) shows almost no influence on the target, 
although an effect is clearly there by construction. 


52 C. Molnar et al. 


Predicted y 
oO 


1 
ES 


=1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 


Feature X, Feature X2 
C 6 D 
> 4.0 
o 
3 3 
no] 
2 o 0.5 
a 
5 3 x< 
ae o 
D 5 0.0 
a © 
-6 è 
-1.0 -0.5 0.0 0.5 1.0 
Feature X2 -0.5 


Sd(deriv) 
NOWRA 
aounon 

4 
oO 


~ io -0.5 0.0 05 1.0 -1.0 -0.5 0.0 05 1.0 
Feature X2 Feature X2 


Fig. 7. Misleading effect due to interactions. Simulation example with inter- 
actions: Y = 3X1 — 6X2 + 12X21(x3>0) + € with X1, X2, X3 “AY Ul[-1,1] and 
EARANN (0,0.3). A random forest with 500 trees is fitted on 1000 observations. Effects 
are calculated on 200 randomly sampled (training) observations. A, B: PDP (yellow) 
and ICE curves of Xı and X2; C: Derivative ICE curves and their standard deviation 


of X2; D: 2-dimensional PDP of X2 and X3. 


Solution: For the PDP, we recommend to additionally consider the correspond- 
ing ICE curves [38]. While PDP and ALE average out interaction effects, ICE 
curves directly show the heterogeneity between individual predictions. Figure 7 
A illustrates that the individual marginal effect curves all follow an upward trend 
with only small variations. Hence, by aggregating these ICE curves to a global 
marginal effect curve such as the PDP, we do not lose much information. How- 
ever, when the regarded feature interacts with other features, such as feature X2 
with feature X3 in this example, then marginal effect curves of different obser- 
vations might not show similar effects on the target. Hence, ICE curves become 
very heterogeneous, as shown in Fig. 7 B. In this case, the influence of feature 
Xə is not well represented by the global average marginal effect. Particularly 
for continuous interactions where ICE curves start at different intercepts, we 
recommend the use of derivative or centered ICE curves, which eliminate differ- 
ences in intercepts and leave only differences due to interactions [38]. Derivative 
ICE curves also point out the regions of highest interaction with other features. 
For example, Fig.7 C indicates that predictions for Xə taking values close to 0 
strongly depend on other features’ values. While these methods show that inter- 
actions are present with regards to the feature of interest but do not reveal other 
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features with which it interacts, the 2-dimensional PDP or ALE plot are options 
to visualize 2-way interaction effects. The 2-dimensional PDP in Fig. 7 D shows 
that predictions with regards to feature Xə highly depend on the feature values 
of feature X3. 

Other methods that aim to gain more insights into these visualizations are 
based on clustering homogeneous ICE curves, such as visual interaction effects 
(VINE) [16] or [122]. As an example, in Fig. 7 B, it would be more meaningful to 
average over the upward and downward proceeding ICE curves separately and 
hence show that the average influence of feature X2 on the target depends on 
an interacting feature (here: X3). Work by Zon et al. [125] followed a similar 
idea by proposing an interactive visualization tool to group Shapley values with 
regards to interacting features that need to be defined by the user. 


Open Issues: The introduced visualization methods are not able to illustrate 
the type of the underlying interaction and most of them are also not applicable 
to higher-order interactions. 


6.2 Failing to Separate Main from Interaction Effects 


Pitfall: Many interpretation methods that quantify a feature’s importance or 
effect cannot separate an interaction from main effects. The PFI, for example, 
includes both the importance of a feature and the importance of all its interac- 
tions with other features [19]. Also local explanation methods such as LIME and 
Shapley values only provide additive explanations without separation of main 
effects and interactions [40]. 


Solution: Functional ANOVA introduced by [53] is probably the most popular 
approach to decompose the joint distribution into main and interaction effects. 
Using the same idea, the H-Statistic [35] quantifies the interaction strength 
between two features or between one feature and all others by decomposing 
the 2-dimensional PDP into its univariate components. The H-Statistic is based 
on the fact that, in the case of non-interacting features, the 2-dimensional par- 
tial dependence function equals the sum of the two underlying univariate par- 
tial dependence functions. Another similar interaction score based on partial 
dependencies is defined by [42]. Instead of decomposing the partial dependence 
function, [87] uses the predictive performance to measure interaction strength. 
Based on Shapley values, Lundberg et al. [77] proposed SHAP interaction val- 
ues, and Casalicchio et al. [19] proposed a fair attribution of the importance of 
interactions to the individual features. 

Furthermore, Hooker [54] considers dependent features and decomposes the 
predictions in main and interaction effects. A way to identify higher-order inter- 
actions is shown in [53]. 


Open Issues: Most methods that quantify interactions are not able to identify 
higher-order interactions and interactions of dependent features. Furthermore, 


54 C. Molnar et al. 


the presented solutions usually lack automatic detection and ranking of all inter- 
actions of a model. Identifying a suitable shape or form of the modeled inter- 
action is not straightforward as interactions can be very different and complex, 
e.g., they can be a simple product of features (multiplicative interaction) or can 
have a complex joint non-linear effect such as smooth spline surface. 


7 Ignoring Model and Approximation Uncertainty 


Pitfall: Many interpretation methods only provide a mean estimate but do not 
quantify uncertainty. Both the model training and the computation of interpre- 
tation are subject to uncertainty. The model is trained on (random) data, and 
therefore should be regarded as a random variable. Similarly, LIME’s surrogate 
model relies on perturbed and reweighted samples of the data to approximate the 
prediction function locally [94]. Other interpretation methods are often defined 
in terms of expectations over the data (PFI, PDP, Shapley values, ...), but are 
approximated using Monte Carlo integration. Ignoring uncertainty can result in 
the interpretation of noise and non-robust results. The true effect of a feature 
may be flat, but — purely by chance, especially on smaller datasets — the Shap- 
ley value might show an effect. This effect could cancel out once averaged over 
multiple model fits. 
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Fig. 8. Ignoring model and approximation uncertainty. PDP for Xı with Y = 
0- Xı +25 Xj + éi with Xi; TEE X10 ma U[0, 1] and Ei N N (0, 0.9). Left: PDP for Xı 
of a random forest trained on 100 data points. Middle: Multiple PDPs (10x) for the 
model from left plots, but with different samples (each n=100) for PDP estimation. 


Right: Repeated (10x) data samples of n=100 and newly fitted random forest. 


Figure8 shows that a single PDP (first plot) can be misleading because it 
does not show the variance due to PDP estimation (second plot) and model 
fitting (third plot). If we are not interested in learning about a specific model, 
but rather about the relationship between feature X, and the target (in this 
case), we should consider the model variance. 
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Solution: By repeatedly computing PDP and PFI with a given model, but with 
different permutations or bootstrap samples, the uncertainty of the estimate 
can be quantified, for example in the form of confidence intervals. For PFI, 
frameworks for confidence intervals and hypothesis tests exist [2,117], but they 
assume a fixed model. If the practitioner wants to condition the analysis on the 
modeling process and capture the process’ variance instead of conditioning on a 
fixed model, PDP and PFI should be computed on multiple model fits [83]. 


Open Issues: While Moosbauer et al. [85] derived confidence bands for PDPs 
for probabilistic ML models that cover the model’s uncertainty, a general model- 
agnostic uncertainty measure for feature effect methods such as ALE [3] and PDP 
[36] has (to the best of our knowledge) not been introduced yet. 


8 Ignoring the Rashomon Effect 


Pitfall: Sometimes different models explain the data-generating process equally 
well, but contradict each other. This phenomenon is called the Rashomon effect, 
named after the movie “Rashomon” from the year 1950. Breiman formalized it 
for predictive models in 2001 [13]: Different prediction models might perform 
equally well (Rashomon set), but construct the prediction function in a different 
way (e.g. relying on different features). This can result in conflicting interpre- 
tations and conclusions about the data. Even small differences in the training 
data can cause one model to be preferred over another. 

For example, Dong and Rudin [29] identified a Rashomon set of equally well 
performing models for the COMPAS dataset. They showed that the models 
differed greatly in the importance they put on certain features. Specifically, if 
criminal history was identified as less important, race was more important and 
vice versa. Cherry-picking one model and its underlying explanation might not 
be sufficient to draw conclusions about the data-generating process. As Hancox- 
Li [48] states “just because race happens to be an unimportant variable in that 
one explanation does not mean that it is objectively an unimportant variable”. 

The Rashomon effect can also occur at the level of the interpretation method 
itself. Differing hyperparameters or interpretation goals can be one reason (see 
Sect. 2). But even if the hyperparameters are fixed, we could still obtain contra- 
dicting explanations by an interpretation method, e.g., due to a different data 
sample or initial seed. 

A concrete example of the Rashomon effect is counterfactual explanations. 
Different counterfactuals may all alter the prediction in the desired way, but 
point to different feature changes required for that change. If a person is deemed 
uncreditworthy, one corresponding counterfactual explaining this decision may 
point to a scenario in which the person had asked for a shorter loan duration 
and amount, while another counterfactual may point to a scenario in which 
the person had a higher income and more stable job. Focusing on only one 
counterfactual explanation in such cases strongly limits the possible epistemic 
access. 
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Solution: If multiple, equally good models exist, their interpretations should 
be compared. Variable importance clouds [29] is a method for exploring variable 
importance scores for equally good models within one model class. If the interpre- 
tations are in conflict, conclusions must be drawn carefully. Domain experts or 
further constraints (e.g. fairness or sparsity) could help to pick a suitable model. 
Semenova et al. [102] also hypothesized that a large Rashomon set could contain 
simpler or more interpretable models, which should be preferred according to 
Sect. 4. 

In the case of counterfactual explanations, multiple, equally good explana- 
tions exist. Here, methods that return a set of explanations rather than a single 
one should be used — for example, the method by Dandl et al. [26] or Mothilal 
et al. [86]. 


Open Issues: Numerous very different counterfactual explanations are over- 
whelming for users. Methods for aggregating or combining explanations are still 
a matter of future research. 


9 Failure to Scale to High-Dimensional Settings 


9.1 Human-Intelligibility of High-Dimensional IML Output 


Pitfall: Applying IML methods naively to high-dimensional datasets (e.g. visu- 
alizing feature effects or computing importance scores on feature level) leads to 
an overwhelming and high-dimensional IML output, which impedes human anal- 
ysis. Especially interpretation methods that are based on visualizations make 
it difficult for practitioners in high-dimensional settings to focus on the most 
important insights. 


Solution: A natural approach is to reduce the dimensionality before applying 
any IML methods. Whether this facilitates understanding or not depends on 
the possible semantic interpretability of the resulting, reduced feature space — 
as features can either be selected or dimensionality can be reduced by linear 
or non-linear transformations. Assuming that users would like to interpret in 
the original feature space, many feature selection techniques can be used [46], 
resulting in much sparser and consequently easier to interpret models. Wrap- 
per selection approaches are model-agnostic and algorithms like greedy forward 
selection or subset selection procedures [5,60], which start from an empty model 
and iteratively add relevant (subsets of) features if needed, even allow to measure 
the relevance of features for predictive performance. An alternative is to directly 
use models that implicitly perform feature selection such as LASSO [112] or 
component-wise boosting [99] as they can produce sparse models with fewer fea- 
tures. In the case of LIME or other interpretation methods based on surrogate 
models, the aforementioned techniques could be applied to the surrogate model. 

When features can be meaningfully grouped in a data-driven or knowledge- 
driven way [51], applying IML methods directly to grouped features instead of 
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single features is usually more time-efficient to compute and often leads to more 
appropriate interpretations. Examples where features can naturally be grouped 
include the grouping of sensor data [20], time-lagged features [75], or one-hot- 
encoded categorical features and interaction terms [43]. Before a model is fitted, 
groupings could already be exploited for dimensionality reduction, for example 
by selecting groups of features by the group LASSO [121]. 

For model interpretation, various papers extended feature importance meth- 
ods from single features to groups of features [5,43,114,119]. In the case of 
grouped PFI, this means that we perturb the entire group of features at once 
and measure the performance drop compared to the unperturbed dataset. Com- 
pared to standard PFI, the grouped PFI does not break the association to the 
other features of the group, but to features of other groups and the target. This is 
especially useful when features within the same group are highly correlated (e.g. 
time-lagged features), but between-group dependencies are rather low. Hence, 
this might also be a possible solution for the extrapolation pitfall described in 
Sect. 5.1. 

We consider the PhoneStudy in [106] as an illustration. The PhoneStudy 
dataset contains 1821 features to analyze the link between human behavior based 
on smartphone data and participants’ personalities. Interpreting the results in 
this use case seems to be challenging since features were dependent and single 
feature effects were either small or non-linear [106]. The features have been 
grouped in behavior-specific categories such as app-usage, music consumption, 
or overall phone usage. Au et al. [5] calculated various grouped importance 
scores on the feature groups to measure their influence on a specific personality 
trait (e.g. conscientiousness). Furthermore, the authors applied a greedy forward 
subset selection procedure via repeated subsampling on the feature groups and 
showed that combining app-usage features and overall phone usage features were 
most of the times sufficient for the given prediction task. 


Open Issues: The quality of a grouping-based interpretation strongly depends 
on the human intelligibility and meaningfulness of the grouping. If the grouping 
structure is not naturally given, then data-driven methods can be used. However, 
if feature groups are not meaningful (e.g. if they cannot be described by a super- 
feature such as app-usage), then subsequent interpretations of these groups are 
purposeless. One solution could be to combine feature selection strategies with 
interpretation methods. For example, LIME’s surrogate model could be a LASSO 
model. However, beyond surrogate models, the integration of feature selection 
strategies remains an open issue that requires further research. 

Existing research on grouped interpretation methods mainly focused on quan- 
tifying grouped feature importance, but the question of “how a group of fea- 
tures influences a model’s prediction” remains almost unanswered. Only recently, 
[5,15,101] attempted to answer this question by using dimension-reduction tech- 
niques (such as PCA) before applying the interpretation method. However, this 
is also a matter of further research. 
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9.2 Computational Effort 


Pitfall: Some interpretation methods do not scale linearly with the number of 
features. For example, for the computation of exact Shapley values the number 
of possible coalitions [25,78], or for a (full) functional ANOVA decomposition 
the number of components (main effects plus all interactions) scales with O(2?) 
[54].? 


Solution: For the functional ANOVA, a common solution is to keep the analysis 
to the main effects and selected 2-way interactions (similar for PDP and ALE). 
Interesting 2-way interactions can be selected by another method such as the 
H-statistic [35]. However, the selection of 2-way interactions requires additional 
computational effort. Interaction strength usually decreases quickly with increas- 
ing interaction size, and one should only consider d-way interactions when all 
their (d—1)-way interactions were significant [53]. For Shapley-based methods, an 
efficient approximation exists that is based on randomly sampling and evaluat- 
ing feature orderings until the estimates converge. The variance of the estimates 
reduces in O(+), where m is the number of evaluated orderings [25,78]. 


9.3 Ignoring Multiple Comparison Problem 


Pitfall: Simultaneously testing the importance of multiple features will result 
in false-positive interpretations if the multiple comparisons problem (MCP) is 
ignored. The MCP is well known in significance tests for linear models and 
exists similarly in testing for feature importance in ML. For example, suppose 
we simultaneously test the importance of 50 features (with the Ho-hypothesis 
of zero importance) at the significance level a = 0.05. Even if all features are 
unimportant, the probability of observing that at least one feature is significantly 
important is 1 — P(‘no feature important’) = 1 — (1 — 0.05)°° ~ 0.923. Multiple 
comparisons become even more problematic the higher the dimension of the 
dataset. 


Solution: Methods such as Model-X knockoffs [17] directly control for the false 
discovery rate (FDR). For all other methods that provide p-values or confidence 
intervals, such as PIMP (Permutation IMPortance) [2], which is a testing app- 
roach for PFI, MCP is often ignored in practice to the best of our knowledge, 
with some exceptions[105, 117]. One of the most popular MCP adjustment meth- 
ods is the Bonferroni correction [31], which rejects a null hypothesis if its p-value 
is smaller than a/p, with p as the number of tests. It has the disadvantage that 
it increases the probability of false negatives [90]. Since MCP is well known 
in statistics, we refer the practitioner to [28] for an overview and discussion of 
alternative adjustment methods, such as the Bonferroni-Holm method [52]. 


? Similar to the PDP or ALE plots, the functional ANOVA components describe 
individual feature effects and interactions. 
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Fig. 9. Failure to scale to high-dimensional settings. Comparison of the num- 
ber of features with significant importance - once with and once without Bonferroni- 
corrected significance levels for a varying number of added noise variables. Datasets 
were sampled from Y = 2X1 + 2X3 + e with X1, X2,€ ~ N(0,1). X3,Xa,...,Xp ~ 
N(0,1) are additional noise variables with p ranging between 2 and 1000. For each p, 
we sampled two datasets from this data-generating process — one to train a random 
forest with 500 trees on and one to test whether feature importances differed from 0 
using PIMP. In all experiments, Xı and X2 were correctly identified as important. 


As an example, in Fig. 9 we compare the number of features with significant 
importance measured by PIMP once with and once without Bonferroni-adjusted 
significance levels (a = 0.05 vs. a = 0.05/p). Without correcting for multi- 
comparisons, the number of features mistakenly evaluated as important grows 
considerably with increasing dimension, whereas Bonferroni correction results in 
only a modest increase. 


10 Unjustified Causal Interpretation 


Pitfall: Practitioners are often interested in causal insights into the underly- 
ing data-generating mechanisms, which IML methods do not generally provide. 
Common causal questions include the identification of causes and effects, pre- 
dicting the effects of interventions, and answering counterfactual questions [88]. 
For example, a medical researcher might want to identify risk factors or predict 
average and individual treatment effects [66]. In search of answers, a researcher 
can therefore be tempted to interpret the result of IML methods from a causal 
perspective. 

However, a causal interpretation of predictive models is often not possible. 
Standard supervised ML models are not designed to model causal relationships 
but to merely exploit associations. A model may therefore rely on causes and 
effects of the target variable as well as on variables that help to reconstruct 
unobserved influences on Y, e.g. causes of effects [118]. Consequently, the ques- 
tion of whether a variable is relevant to a predictive model (indicated e.g. by 
PFI > 0) does not directly indicate whether a variable is a cause, an effect, 
or does not stand in any causal relation to the target variable. Furthermore, 
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even if a model would rely solely on direct causes for the prediction, the causal 
structure between features must be taken into account. Intervening on a variable 
in the real world may affect not only Y but also other variables in the feature 
set. Without assumptions about the underlying causal structure, IML methods 
cannot account for these adaptions and guide action [58,62]. 

As an example, we constructed a dataset by sampling from a structural causal 
model (SCM), for which the corresponding causal graph is depicted in Fig. 10. All 
relationships are linear Gaussian with variance 1 and coefficients 1. For a linear 
model fitted on the dataset, all features were considered to be relevant based 
on the model coefficients (ĝ = 0.32921 + 0.323x2 — 0.327x3 + 0.34224 + 0.33425, 
R? = 0.943), although x3, x4 and z5 do not cause Y. 


Solution: The practitioner must carefully assess whether sufficient assumptions 
can be made about the underlying data-generating process, the learned model, 
and the interpretation technique. If these assumptions are met, a causal inter- 
pretation may be possible. The PDP between a feature and the target can be 
interpreted as the respective average causal effect if the model performs well and 
the set of remaining variables is a valid adjustment set [123]. When it is known 
whether a model is deployed in a causal or anti-causal setting — i.e. whether 
the model attempts to predict an effect from its causes or the other way round 
—a partial identification of the causal roles based on feature relevance is pos- 
sible (under strong and non-testable assumptions) [118]. Designated tools and 
approaches are available for causal discovery and inference [91]. 


Open Issues: The challenge of causal discovery and inference remains an open 
key issue in the field of ML. Careful research is required to make explicit under 
which assumptions what insight about the underlying data-generating mecha- 
nism can be gained by interpreting an ML model. 


Fig. 10. Causal graph 


11 Discussion 


In this paper, we have reviewed numerous pitfalls of local and global model- 
agnostic interpretation techniques, e.g. in the case of bad model generalization, 
dependent features, interactions between features, or causal interpretations. We 
have not attempted to provide an exhaustive list of all potential pitfalls in ML 
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model interpretation, but have instead focused on common pitfalls that apply 
to various model-agnostic IML methods and pose a particularly high risk. 

We have omitted pitfalls that are more specific to one IML method type: 
For local methods, the vague notions of neighborhood and distance can lead to 
misinterpretations [68,69], and common distance metrics (such as the Euclidean 
distance) are prone to the curse of dimensionality [1]; Surrogate methods such 
as LIME may not be entirely faithful to the original model they replace in 
interpretation. Moreover, we have not addressed pitfalls associated with certain 
data types (like the definition of superpixels in image data [98]), nor those related 
to human cognitive biases (e.g. the illusion of model understanding [22]). 

Many pitfalls in the paper are strongly linked with axioms that encode 
desiderata of model interpretation. For example, pitfall Sect. 5.3 (misunderstand- 
ing conditional interpretations) is related to violations of sensitivity [56,110]. As 
such, axioms can help to make the strengths and limitations of methods explicit. 
Therefore, we encourage an axiomatic evaluation of interpretation methods. 

We hope to promote a more cautious approach when interpreting ML models 
in practice, to point practitioners to already (partially) available solutions, and 
to stimulate further research on these issues. The stakes are high: ML algorithms 
are increasingly used for socially relevant decisions, and model interpretations 
play an important role in every empirical science. Therefore, we believe that 
users can benefit from concrete guidance on properties, dangers, and problems 
of IML techniques — especially as the field is advancing at high speed. We need 
to strive towards a recommended, well-understood set of tools, which will in turn 
require much more careful research. This especially concerns the meta-issues of 
comparisons of IML techniques, IML diagnostic tools to warn against mislead- 
ing interpretations, and tools for analyzing multiple dependent or interacting 
features. 
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Abstract. Providing explanations in the context of Visual Question Answering 
(VQA) presents a fundamental problem in machine learning. To obtain detailed 
insights into the process of generating natural language explanations for VQA, 
we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset 
with natural language explanations. For each image-question pair in the CLEVR 
dataset, CLEVR-X contains multiple structured textual explanations which are 
derived from the original scene graphs. By construction, the CLEVR-X explana- 
tions are correct and describe the reasoning and visual information that is neces- 
sary to answer a given question. We conducted a user study to confirm that the 
ground-truth explanations in our proposed dataset are indeed complete and rel- 
evant. We present baseline results for generating natural language explanations 
in the context of VQA using two state-of-the-art frameworks on the CLEVR-X 
dataset. Furthermore, we provide a detailed analysis of the explanation genera- 
tion quality for different question and answer types. Additionally, we study the 
influence of using different numbers of ground-truth explanations on the conver- 
gence of natural language generation (NLG) metrics. The CLEVR-X dataset is 
publicly available at https://github.com/ExplainableML/CLEVR-X. 


Keywords: Visual question answering - Natural language explanations 


1 Introduction 


Explanations for automatic decisions form a crucial step towards increasing trans- 
parency and human trust in deep learning systems. In this work, we focus on natural 
language explanations in the context of vision-language tasks. 

In particular, we consider the vision-language task of Visual Question Answering 
(VQA) which consists of answering a question about an image. This requires multiple 
skills, such as visual perception, text understanding, and cross-modal reasoning in the 
visual and language domains. A natural language explanation for a given answer allows 
a better understanding of the reasoning process for answering the question and adds 
transparency. However, it is challenging to formulate what comprises a good textual 
explanation in the context of VQA involving natural images. 


© The Author(s) 2022 
A. Holzinger et al. (Eds.): xxAI 2020, LNAI 13200, pp. 69-88, 2022. 
https://doi.org/10.1007/978-3-03 1-04083-2_5 


70 L. Salewski et al. 


VQA-X e-SNLI-VE CLEVR-X 

Question: Does this scene Hypothesis: A woman is hold- Question: There is a purple 
look like it could be from the ing a child. metallic ball; what number of 
early 1950s? cyan objects are right of it? 


Answer | Explanation: Answer | Explanation: Answer | Explanation: 
Yes | The photo is in black and Entailment | If a woman holds 1 | There is a cyan cylinder 
white and the cars are all clas- a child she is holding a child. which is on the right side of 
sic designs from the 1950s the purple metallic ball. 


Fig. 1. Comparing examples from the VQA-X (left), e-SNLI-VE (middle), and CLEVR-X (right) 
datasets. The explanation in VQA-X requires prior knowledge (about cars from the 1950s), e- 
SNLI-VE argues with a tautology, and our CLEVR-X only uses abstract visual reasoning. 


Explanation datasets commonly used in the context of VQA, such as the VQA-X 
dataset [26] or the e-SNLI-VE dataset [13,29] for visual entailment, contain expla- 
nations of widely varying quality since they are generated by humans. The ground- 
truth explanations in VQA-X and e-SNLI-VE can range from statements that merely 
describe an image to explaining the reasoning about the question and image involving 
prior information, such as common knowledge. One example for a ground-truth expla- 
nation in VQA-X that requires prior knowledge about car designs from the 1950s can be 
seen in Fig. 1. The e-SNLI-VE dataset contains numerous explanation samples which 
consist of repeated statements (“x because x”). Since existing explanation datasets for 
vision-language tasks contain immensely varied explanations, it is challenging to per- 
form a structured analysis of strengths and weaknesses of existing explanation genera- 
tion methods. 

In order to fill this gap, we propose the novel, diagnostic CLEVR-X dataset 
for visual reasoning with natural language explanations. It extends the synthetic 
CLEVR [27] dataset through the addition of structured natural language explanations 
for each question-image pair. An example for our proposed CLEVR-X dataset is shown 
in Fig. 1. The synthetic nature of the CLEVR-X dataset results in several advantages 
over datasets that use human explanations. Since the explanations are synthetically 
constructed from the underlying scene graph, the explanations are correct and do not 
require auxiliary prior knowledge. The synthetic textual explanations do not suffer from 
errors that get introduced with human explanations. Nevertheless, the explanations in 
the CLEVR-X dataset are human parsable as demonstrated in the human user study that 
we conducted. Furthermore, the explanations contain all the information that is neces- 
sary to answer a given question about an image without seeing the image. This means 
that the explanations are complete with respect to the question about the image. 

The CLEVR-X dataset allows for detailed diagnostics of natural language expla- 
nation generation methods in the context of VQA. For instance, it contains a wider 
range of question types than other related datasets. We provide baseline performances 
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on the CLEVR-X dataset using recent frameworks for natural language explanations in 
the context of VQA. Those frameworks are jointly trained to answer the question and 
provide a textual explanation. Since the question family, question complexity (num- 
ber of reasoning steps required), and the answer type (binary, counting, attributes) is 
known for each question and answer, the results can be analyzed and split according to 
these groups. In particular, the challenging counting problem [48], which is not well- 
represented in the VQA-X dataset, can be studied in detail on CLEVR-X. Furthermore, 
our dataset contains multiple ground-truth explanations for each image-question pair. 
These capture a large portion of the space of correct explanations which allows for a 
thorough analysis of the influence of the number of ground-truth explanations used on 
the evaluation metrics. Our approach of constructing textual explanations from a scene 
graph yields a great resource which could be extended to other datasets that are based 
on scene graphs, such as the CLEVR-CoGenT dataset. 

To summarize, we make the following four contributions: (1) We introduce the 
CLEVR-X dataset with natural language explanations for Visual Question Answering; 
(2) We confirm that the CLEVR-X dataset consists of correct explanations that con- 
tain sufficient relevant information to answer a posed question by conducting a user 
study; (3) We provide baseline performances with two state-of-the-art methods that 
were proposed for generating textual explanations in the context of VQA; (4) We use the 
CLEVR-X dataset for a detailed analysis of the explanation generation performance for 
different subsets of the dataset and to better understand the metrics used for evaluation. 


2 Related Work 


In this section, we discuss several themes in the literature that relate to our work, namely 
Visual Question Answering, Natural language explanations (for vision-language tasks), 
and the CLEVR dataset. 


Visual Question Answering (VQA). The VQA [5] task has been addressed by several 
works that apply attention mechanisms to text and image features [16,45,55,56, 60]. 
However, recent works observed that the question-answer bias in common VQA datasets 
can be exploited in order to answer questions without leveraging any visual informa- 
tion [1,2,27,59]. This has been further investigated in more controlled dataset settings, 
such as the CLEVR [27], VQA-CP [2], and GQA [25] datasets. In addition to a controlled 
dataset setting, our proposed CLEVR-X dataset contains natural language explanations 
that enable a more detailed analysis of the reasoning in the context of VQA. 


Natural Language Explanations. Decisions made by neural networks can be visually 
explained with visual attribution that is determined by introspecting trained networks 
and their features [8,43,46,57,58], by using input perturbations [14,15,42], or by 
training a probabilistic feature attribution model along with a task-specific CNN [30]. 
Complementary to visual explanations methods that tend to not help users distin- 
guish between correct and incorrect predictions [32], natural language explanations 
have been investigated for a variety of tasks, such as fine-grained visual object clas- 
sification [20,21], or self-driving car models [31]. The requirement to ground lan- 
guage explanations in the input image can prevent shortcuts, such as relying on dataset 
statistics or referring to instance attributes that are not present in the image. For a 
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comprehensive overview of research on explainability and interpretability, we refer to 
recent surveys [7, 10, 17]. 


Natural Language Explanations for Vision-Language Tasks. Multiple datasets for 
natural language explanations in the context of vision-language tasks have been pro- 
posed, such as the VQA-X [26], VQA-E [35], and e-SNLI-VE datasets [29]. VQA- 
X [26] augments a small subset of the VQA v2 [18] dataset for the Visual Question 
Answering task with human explanations. Similarly, the VQA-E dataset [35] extends 
the VQA v2 dataset by sourcing explanations from image captions. However, the VQA- 
E explanations resemble image descriptions and do not provide satisfactory justifi- 
cations whenever prior knowledge is required [35]. The e-SNLI-VE [13,29] dataset 
combines human explanations from e-SNLI [11] and the image-sentence pairs for the 
Visual Entailment task from SNLI-VE [54]. In contrast to the VQA-E, VQA-X, and 
e-SNLI-VE datasets which consist of human explanations or image captions, our pro- 
posed dataset contains systematically constructed explanations derived from the asso- 
ciated scene graphs. Recently, several works have aimed at generating natural language 
explanations for vision-language tasks [26,29,38,40,52,53]. In particular, we use the 
PJ-X [26] and FM [53] frameworks to obtain baseline results on our proposed CLEVR- 
X dataset. 


The CLEVR Dataset. The CLEVR dataset [27] was proposed as a diagnostic dataset 
to inspect the visual reasoning of VQA models. Multiple frameworks have been pro- 
posed to address the CLEVR task [23, 24, 28,41,44,47]. To add explainability, the XNM 
model [44] adopts the scene graph as an inductive bias which enables the visualization 
of the reasoning based on the attention on the nodes of the graph. There have been 
numerous dataset extensions for the CLEVR dataset, for instance to measure the gen- 
eralization capabilities of models pre-trained on CLEVR (CLOSURE [51]), to evaluate 
object detection and segmentation (CLEVR-Ref+ [37]), or to benchmark visual dia- 
log models (CLEVR dialog [34]). The Compositional Reasoning Under Uncertainty 
(CURI) benchmark uses the CLEVR renderer to construct a test bed for compositional 
and relational learning under uncertainty [49]. [22] provide an extensive survey of fur- 
ther experimental diagnostic benchmarks for analyzing explainable machine learning 
frameworks along with proposing the KandinskyPATTERNS benchmark that contains 
synthetic images with simple 2-dimensional objects. It can be used for testing the qual- 
ity of explanations and concept learning. Additionally, [6] proposed the CLEVR-XAI- 
simple and CLEVR-XAI-complex datasets which provide ground-truth segmentation 
information for heatmap-based visual explanations. Our CLEVR-X augments the exist- 
ing CLEVR dataset with explanations, but in contrast to (heatmap-based) visual expla- 
nations, we focus on natural language explanations. 


3 The CLEVR-X Dataset 


In this section, we introduce the CLEVR-X dataset that consists of natural language 
explanations in the context of VQA. The CLEVR-X dataset extends the CLEVR 
dataset with 3.6 million natural language explanations for 850k question-image pairs. 
In Sect. 3.1, we briefly describe the CLEVR dataset, which forms the base for our pro- 
posed dataset. Next, we present an overview of the CLEVR-X dataset by describing 
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how the natural language explanations were obtained in Sect.3.2, and by providing a 
comprehensive analysis of the CLEVR-X dataset in Sect. 3.3. Finally, in Sect. 3.4, we 
present results for a user study on the CLEVR-X dataset. 


3.1 The CLEVR Dataset 


The CLEVR dataset consists of images with corresponding full scene graph annota- 
tions which contain information about all objects in a given scene (as nodes in the 
graph) along with spatial relationships for all object pairs. The synthetic images in the 
CLEVR dataset contain three to ten (at least partially visible) objects in each scene, 
where each object has the four distinct properties size, color, material, and 
shape. There are three shapes (box, sphere, cylinder), eight colors (gray, red, 
blue, green, brown, purple, cyan, yellow), two sizes (large, smal1), and 
two materials (rubber, metallic). This allows for up to 96 different combinations 
of properties. 

There are a total of 90 different question families in the dataset which are grouped 
into 9 different question types. Each type contains questions from between 5 and 28 
question families. In the following, we describe the 9 question types in more detail. 


Hop Questions: The zero hop, one hop, two hop, and three hop question types contain 
up to three relational reasoning steps, e.g. “What color is the cube to the left of the 
ball?” is a one hop question. 


Compare and Relate Questions: The compare integer, same relate, and comparison 
question types require the understanding and comparison of multiple objects in a scene. 
Questions of the compare integer type compare counts corresponding to two indepen- 
dent clauses (e.g. “Are there more cubes than red balls?”). Same relate questions reason 
about objects that have the same attribute as another previously specified object (e.g. 
“What is the color of the cube that has the same size as the ball?”). In contrast, compar- 
ison question types compare the attributes of two objects (e.g. “Is the color of the cube 
the same as the ball?”). 


Single and/or Questions: Single or questions identify objects that satisfy an exclusive 
disjunction condition (e.g. “How many objects are either red or blue?”). Similarly, sin- 
gle and questions apply multiple relations and filters to find an object that satisfies all 
conditions (e.g. “How many objects are red and to the left of the cube.”). 


Each CLEVR question can be represented by a corresponding functional program 
and its natural language realization. A functional program is composed of basic func- 
tions that resemble elementary visual reasoning operations, such as filtering objects by 
one or more properties, relating objects to each other, or querying object properties. 
Furthermore, logical operations like and and or, as well as counting operations like 
count, less, more, and equal are used to build complex questions. Executing the func- 
tional program associated with the question against the scene graph yields the correct 
answer to the question. We can distinguish between three different answer types: Binary 
answers (yes or no), counting answers (integers from 0 to 10), and attribute answers 
(any of the possible values of shape, color, size, ormaterial). 
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Image Scene graph Tracing the functional program Explanation generation 


| Template: 
\ / Program Parameters | Tracing result There <verb2> a <obj2> {that, which} have the 
a] filter unique <size>’ |-<tiny> <objl>: @ {same, identical} <attribute> as the <obj1>. 


same <attribute> | <material> | <obj2>: B a 


Explanation: 


Question: What number of other 
objects are there of the same count <verb2>: are There are a large yellow metallic cube and cylinder 
material as the tiny thing? 


that have the same material as the tiny sphere. 


Fig. 2. CLEVR-X dataset generation: Generating a natural language explanation for a sample 
from the CLEVR dataset. Based on the question, the functional program for answering the ques- 
tion is executed on the scene graph and traced. A language template is used to cast the gathered 
information into a natural language explanation. 


3.2 Dataset Generation 


Here, we describe the process for generating natural language explanations for the 
CLEVR-X dataset. In contrast to image captions, the CLEVR-X explanations only 
describe image elements that are relevant to a specific input question. The explanation 
generation process for a given question-image pair is illustrated in Fig. 2. It consists of 
three steps: Tracing the functional program, relevance filtering (not shown in the figure), 
and explanation generation. In the following, we will describe those steps in detail. 


Tracing the Functional Program. Given a question-image pair from the CLEVR 
dataset, we trace the execution of the functional program (that corresponds to the ques- 
tion) on the scene graph (which is associated with the image). The generation of the 
CLEVR dataset uses the same step to obtain a question-answer pair. When executing 
the basic functions that comprise the functional program, we record their outputs in 
order to collect all the information required for explaining a ground-truth answer. 

In particular, we trace the filter, relate and same-property functions and record the 
returned objects and their properties, such as shape, size etc. As a result, the tracing 
omits objects in the scene that are not relevant for the question. As we are aiming for 
complete explanations for all question types, each explanation has to mention all the 
objects that were needed to answer the question, i.e. all the evidence that was obtained 
during tracing. For example, for counting questions, all objects that match the filter 
function preceding the counting step are recorded during tracing. For and questions, we 
merge the tracing results of the preceding functions which results in short and readable 
explanations. In summary, the tracing produces a complete and correct understanding 
of the objects and relevant properties which contributed to an answer. 


Relevance Filtering. To keep the explanation at a reasonable length, we filter the object 
attributes that are mentioned in the explanation according to their relevance. For exam- 
ple, the color of an object is not relevant for a given question that asks about the 
material of said object. We deem all properties that were listed in the question to 
be relevant. This makes it easier to recognize the same referenced object in both the 
question and explanation. As the shape property also serves as a noun in CLEVR, our 
explanations always mention the shape to avoid using generic shape descriptions like 
“object” or “thing”. We distinguish between objects which are used to build the ques- 
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tion (e.g. “[...] that is left of the cube?) and those that are the subject of the posed 
question (e.g. “What color is the sphere that is left of the cube?”). For the former, we 
do not mention any additional properties, and for the latter, we mention the queried 
property (e.g. color) for question types yielding attribute answers. 


Explanation Generation. To obtain the final natural language explanations, each ques- 
tion type is equipped with one or more natural language templates with variations in 
terms of the wording used. Each template contains placeholders which are filled with 
the output of the previous steps, i.e. the tracing of the functional program and subse- 
quent filtering for relevance. As mentioned above, our explanations use the same prop- 
erty descriptions that appeared in the question. This is done to ensure that the wording 
of the explanation is consistent with the given question, e.g. for the question “Is there 
a small object?” we generate the explanation “Yes there is a small cube.”! . We ran- 
domly sample synonyms for describing the properties of objects that do not appear in 
the question. If multiple objects are mentioned in the explanation, we randomize their 
order. If the tracing step returned an empty set, e.g. if no object exists that matches the 
given filtering function for an existence or counting question, we state that no relevant 
object is contained in the scene (e.g. “There is no red cube.’”). 

In order to decrease the overall sentence length and to increase the readability, 
we aggregate repetitive descriptions (e.g. “There is a red cube and a red cube”) using 
numerals (e.g. “There are two red cubes.”). In addition, if a function of the functional 
program merely restricts the output set of a preceding function, we only mention the 
outputs of the later function. For instance, if a same-color function yields a large 
and a small cube, and a subsequent filter-large function restricts the output to 
only the large cube, we do not mention the output of same-color, as the output of 
the following £ilter-large causes natural language redundancies? . 

The selection of different language templates, random sampling of synonyms and 
randomization of the object order (if possible) results in multiple different explanations. 
We uniformly sample up to 10 different explanations per question for our dataset. 


Dataset Split. We provide explanations for the CLEVR training and validation sets, 
skipping only a negligible subset (less than 0.04%) of questions due to malformed 
question programs from the CLEVR dataset, e.g. due to disjoint parts of their abstract 
syntax trees. In total, this affected 25 CLEVR training and 4 validation questions. 

As the scene graphs and question functional programs are not publicly available for 
the CLEVR test set, we use the original CLEVR validation subset as the CLEVR-X test 
set. 20% of the CLEVR training set serve as the CLEVR-X validation set. We perform 
this split on the image-level to avoid any overlap between images in the CLEVR-X 
training and validation sets. Furthermore, we verified that the relative proportion of 


' The explanation could have used the synonym “box” instead of “cube”. In contrast, “tiny” 
and “small” are also synonyms in CLEVR, but the explanation would not have been consistent 
with the question which used “small”. 

? E.g. for the question: “How many large objects have the same color as the cube?”, we do not 
generate the explanation “There are a small and a large cube that have the same color as the 
red cylinder of which only the large cube is large.” but instead only write “There is a large 
cube that has the same color as the red cylinder.”. 
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samples from each question and answer type in the CLEVR-X training and validation 
sets is similar, such that there are no biases towards specific question or answer types. 

Code for generating the CLEVR-X dataset and the dataset itself are publicly avail- 
able at https://github.com/ExplainableML/CLEVR-X. 


3.3 Dataset Analysis 


Table 1. Statistics of the CLEVR-X dataset compared to the VQA-X, and e-SNLI-VE datasets. 
We show the total number of images, questions, and explanations, vocabulary size, and the aver- 
age number of explanations per question, the average number of words per explanation, and the 
average number of words per question. Note that subsets do not necessarily add up to the Total 
since some subsets have overlaps (e.g. for the vocabulary). 


Total # Average # 
Dataset Subset 2A VESE 


Images Questions Explanations Vocabulary Explanations Expl. Words Quest. Words 


Train 24,876 29,549 31,536 9,423 1.07 10.55 7.50 
Val 1,431 1,459 4,377 3,373 3.00 10.88 7.56 
VQA-X Test 1,921 1,921 5,904 3,703 3.07 10.93 7.31 
Total 28,180 32,886 41,817 10,315 1.48 10.64 7.49 
Train 29,779 401,672 401,672 36,778 1.00 13.62 8.23 
Val 1,000 14,339 14,339 8,311 1.00 14.67 8.10 
e-SNLI-VE Test 998 14,712 14,712 8,334 1.00 14.59 8.20 
Total 31,777 430,723 430,723 38,208 1.00 13.69 8.23 
Train 56,000 559,969 2,401,275 96 4.29 21.52 21.61 
Val 14,000 139,995 599,711 96 4.28 21.54 21.62 
CLEVR-X Test 15,000 149,984 644,151 96 4.29 21.54 21.62 
Total 85,000 849,948 3,645,137 96 4.29 21.53 21.61 


We compare the CLEVR-X dataset to the related VQA-X and e-SNLI-VE datasets in 
Table |. Similar to CLEVR-X, VQA-X contains natural language explanations for the 
VQA task. However, different to the natural images and human explanations in VQA- 
X, CLEVR-X consists of synthetic images and explanations. The e-SNLI-VE dataset 
provides explanations for the visual entailment (VE) task. VE consists of classifying an 
input image-hypothesis pair into entailment / neutral / contradiction categories. 

The CLEVR-X dataset is significantly larger than the VQA-X and e-SNLI-VE 
datasets in terms of the number of images, questions, and explanations. In contrast 
to the two other datasets, CLEVR-X provides (on average) multiple explanations for 
each question-image pair in the train set. Additionally, the average number of words 
per explanation is also higher. Since the explanations are built so that they explain each 
component mentioned in the question, long questions require longer explanations than 
short questions. Nevertheless, by design, there are no unnecessary redundancies. The 
explanation length in CLEVR-X is very strongly correlated with the length of the cor- 
responding question (Spearman’s correlation coefficient between the number of words 
in the explanations and questions is 0.89). 
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Average explanation length Average explanation length 


Dataset 
VQA-X 
== CLEVR-X 
e-SNLI-VE 


Question types 60k 
compare integer 
single and Bik 
comparison 

zero hop 
three hop 
two hop 
same relate 
one hop 
single or 
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Fig. 3. Stacked histogram of the average explanation lengths measured in words for the nine ques- 
tion types for the CLEVR-X training set (left). Explanation length distribution for the CLEVR-X, 
VQA-X, and e-SNLI-VE training sets (right). The long tail of the e-SNLI-VE distribution (125 
words) was cropped out for better readability. 


Figure 3 (left) shows the explanation length distribution in the CLEVR-X dataset for 
the nine question types. The shortest explanation consists of 7 words, and the longest 
one has 53 words. On average, the explanations contain 21.53 words. In Fig. 3 (right) 
and Table 1, we can observe that explanations in CLEVR-X tend to be longer than the 
explanations in the VQA-X dataset. Furthermore, VQA-X has significantly fewer sam- 
ples overall than the CLEVR-X dataset. The e-SNLI-VE dataset also contains longer 
explanations (that are up to 125 words long), but the CLEVR-X dataset is significantly 
larger than the e-SNLI-VE dataset. However, due to the synthetic nature and limited 
domain of CLEVR, the vocabulary of CLEVR-X is very small with only 96 different 
words. Unfortunately, VQA-X and e-SNLI-VE contain spelling errors, resulting in mul- 
tiple versions of the same words. Models trained on CLEVR-X circumvent those afore- 
mentioned challenges and can purely focus on visual reasoning and explanations for the 
same. Therefore, Natural Language Generation (NLG) metrics applied to CLEVR-X 
indeed capture the factual correctness and completeness of an explanation. 


3.4 User Study on Explanation Completeness and Relevance 


In this section, we describe our user study for evaluating the completeness and relevance 
of the generated ground-truth explanations in the CLEVR-X dataset. We wanted to 
verify whether humans are successfully able to parse the synthetically generated textual 
explanations and to select complete and relevant explanations. While this is obvious for 
easier explanations like “There is a blue sphere.”, it is less trivial for more complex 
explanations such as “There are two red cylinders in front of the green cube that is to 
the right of the tiny ball.” Thus, strong human performance in the user study indicates 
that the sentences are parsable by humans. 

We performed our user study using Amazon Mechanical Turk (MTurk). It con- 
sisted of two types of Human Intelligence Tasks (HITs). Each HIT was made up of 
(1) An explanation of the task; (2) A non-trivial example, where the correct answers are 
already selected; (3) ACAPTCHA [3] to verify that the user is human; (4) The problem 
definition consisting of a question and an image; (5) A user qualification step, for which 
the user has to correctly answer a question about an image. This ensures that the user is 
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Image: Image: 


Question: There is a big purple metallic sphere; what 


Question: Are there any large matte things to the left number of brown matte spheres are to the left of it? 


of the big brown matte sphere? 


What is the correct answer for the given question 
and image? 


What is the correct answer for the given question 
and image? 


[o no | {o cylinder | [O sphere | [o yes ] 


Which explanation matches the given question 


Which of these explanations is complete? and image best? 
O Explanation 1 O Explanation 1 
There is a large matte ball which is on the left side of the big brown There are no brown matte spheres that are to the left of the big purple 
matte sphere. metallic sphere. 
O Explanation 2 O Explanation 2 
There are no large matte things which are to the left of the big brown There is a rubber ball which is on the right side of the large green 


matte sphere. cylinder. 


Fig. 4. Two examples from our user study to evaluate the completeness (left) and relevance (right) 
of natural language explanations in the CLEVR-X dataset. 


able to answer the question in the first place, a necessary condition to participate in our 
user study; (6) Two explanations from which the user needs to choose one. Example 
screenshots of the user interface for the user study are shown in Fig. 4. 

For the two different HIT types, we randomly sampled 100 explanations from each 
of the 9 question types, resulting in a total of 1800 samples for the completeness and 
relevance tasks. For each task sample, we requested 3 different MTurk workers based 
in the US (with high acceptance rate of > 95% and over 5000 accepted HITs). A total 
of 78 workers participated in the completeness HITs. They took on average 144.83 s 
per HIT. The relevance task was carried out by 101 workers which took on average 
120.46 s per HIT. In total, 134 people participated in our user study. In the following, 
we describe our findings regarding the completeness and relevance of the CLEVR-X 
explanations in more detail. 


Explanation Completeness. In the first part of the user study, we evaluated whether 
human users are able to determine if the ground-truth explanations in the CLEVR- 
X dataset are complete (and also correct). We presented the MTurk workers with an 
image, a question, and two explanations. As can be seen in Fig. 4 (left), a user had to 
first select the correct answer (yes) before deciding which of the two given explanations 
was complete. By design, one of the explanations presented to the user was the com- 
plete one from the CLEVR-X dataset and the other one was a modified version for 
which at least one necessary object had been removed. As simply deleting an object 
from a textual explanation could lead to grammar errors, we re-generated the explana- 
tions after removing objects from the tracing results. This resulted in incomplete, albeit 
grammatically correct, explanations. 
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Table 2. Results for the user study evaluating the accuracy for the completeness and relevance 
tasks for the nine question types in the CLEVR-X dataset. 


Zero One Two Three Same Compari- Compare Single Single 


hop hop hop hop relate son integer or and all 
Completeness 100.00 98.00 98.67 94.00 100.00 83.67 77.00 84.00 94.33 92.19 


Relevance 99.67 99.00 95.67 89.00 95.67 87.33 83.67 90.67 92.00 92.52 


To evaluate the ability to determine the completeness of explanations, we measured 
the accuracy of selecting the complete explanation. The human participants obtained 
an average accuracy of 92.19%, confirming that complete explanations which mention 
all objects necessary to answer a given question were preferred over incomplete ones. 
The performance was weaker for complex question types, such as compare-integer and 
comparison with accuracies of only 77.00% and 83.67% respectively, compared to the 
easier zero-hop and one-hop questions with accuracies of 100% and 98.00% respec- 
tively. 

Additionally, there were huge variations in performance across different partici- 
pants of the completeness study (Fig. 5 (top left)), with the majority performing very 
well (>97% answering accuracy) for most question types. For the compare-integer, 
comparison and single or question types, some workers exhibited a much weaker per- 
formance with answering accuracies as low as 0%. The average turnaround time shown 
in Fig.5 (bottom left) confirms that complex question types required less time to be 
solved than more complex question types, such as three hop and compare integer ques- 
tions. Similar to the performance, the work time varied greatly between different users. 


Explanation Relevance. In the second part of our user study, we analyzed if humans 
are able to identify explanations which are relevant for a given image. For a given 
question-image pair, the users had to first select the correct answer. Furthermore, they 
were provided with a correct explanation and another randomly chosen explanation 
from the same question family (that did not match the image). The task consisted of 
selecting the correct explanation that matched the image and question content. Expla- 
nation | in the example user interface shown in Fig. 4 (right) was the relevant one, since 
Explanation 2 does not match the question and image. 

The participants of our user study were able to determine which explanation 
matched the given question-image example with an average accuracy of 92.52%. Again, 
the performance for complex question types was weaker than for easier questions. The 
difficulty of the question influences the accuracy of detecting the relevant explana- 
tion, since this task first requires understanding the question. Furthermore, complex 
questions tend to be correlated with complex scenes that contain many objects which 
makes the user’s task more challenging. The accuracy for three-hop questions was 
89.00% compared to 99.67% for zero-hop questions. For compare-integer and com- 
parison questions, the users obtained accuracies of 83.67% and 87.33% respectively, 
which is significantly lower than the overall average accuracy. 

We analyzed the answering accuracy per worker in Fig.5 (top). The performance 
varies greatly between workers, with the majority performing very well (>90% answer- 
ing accuracy) for most question types. Some workers showed much weaker perfor- 
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Fig. 5. Average answering accuracies for each worker (top) and average work time (bottom) for 
the user study (left: completeness, right: relevance). The boxes indicate the mean as well as lower 
and upper quartiles, the lines extend 1.5 interquartile ranges of the lower and upper quartile. All 
other values are plotted as diamonds. 


mance with answering accuracies as low as 0% (e.g. for compare-integer and single or 
questions). Furthermore, the distribution of work time for the relevance task is shown in 
Fig. 5 (bottom right). The turnaround times for each worker exhibit greater variation on 
the completeness task (bottom left) compared to the relevance task (bottom right). This 
might be due to the nature of the different tasks. For the completeness task, the users 
need to check if the explanation contains all the elements that are necessary to answer 
the given question. The relevance task, on the other hand, can be solved by detecting a 
single non-relevant object to discard the wrong explanation. 

Our user study confirmed that humans are able to parse the synthetically generated 
natural language explanations in the CLEVR-X dataset. Furthermore, the results have 
shown that users prefer complete and relevant explanations in our dataset over corrupted 
samples. 


4 Experiments 


We describe the experimental setup for establishing baselines on our proposed CLEVR- 
X dataset in Sect.4.1. In Sect. 4.2, we present quantitative results on the CLEVR-X 
dataset. Additionally, we analyze the generated explanations for the CLEVR-X dataset 
in relation to the question and answer types in Sect.4.3. Furthermore, we study the 
behavior of the NLG metrics when using different numbers of ground-truth expla- 
nations for testing in Sect. 4.4. Finally, we present qualitative explanation generation 
results on the CLEVR-X dataset in Sect. 4.5. 
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4.1 Experimental Setup 


In this section, we provide details about the datasets and models used to establish base- 
lines for our CLEVR-X dataset and about their training details. Furthermore, we explain 
the metrics for evaluating the explanation generation performance. 


Datasets. In the following, we summarize the datasets that were used for our exper- 
iments. In addition to providing baseline results on CLEVR-X, we also report exper- 
imental results on the VQA-X and e-SNLI-VE datasets. Details about our proposed 
CLEVR-X dataset can be found in Sect. 3. The VQA-X dataset [26] is a subset of the 
VQA v2 dataset with a single human-generated textual explanation per question-image 
pair in the training set and 3 explanations for each sample in the validation and test sets. 
The e-SNLI-VE dataset [13,29] is a large-scale dataset with natural language explana- 
tions for the visual entailment task. 


Methods. We used multiple frameworks to provide baselines on our proposed CLEVR- 
X dataset. For the random words baseline, we sample random word sequences of 
length w for the answer and explanation words for each test sample. The full vocab- 
ulary corresponding to a given dataset is used as the sampling pool, and w denotes the 
average number of words forming an answer and explanation in a given dataset. For 
the random explanations baseline, we randomly sample an answer-explanation pair 
from the training set and use this as the prediction. The explanations from this base- 
line are well-formed sentences. However, the answers and explanations most likely do 
not match the question or the image. For the random-words and random-explanations 
baselines, we report the NLG metrics for all samples in the test set (instead of only 
considering the correctly answered samples, since the random sampling of the answer 
does not influence the explanation). The Pointing and Justification model PJ-X [26] 
provides text-based post-hoc justifications for the VQA task. It combines a modified 
MCB [16] framework, pre-trained on the VQA v2 dataset, with a visual pointing and 
textual justification module. The Faithful Multimodal (FM) model [53] aims at ground- 
ing parts of generated explanations in the input image to provide explanations that are 
faithful to the input image. It is based on the Up-Down VQA model [4]. In addition, 
FM contains an explanation module which enforces consistency between the predicted 
answer, explanation and the attention of the VQA model. The implementations for the 
PJ-X and FM models are based on those provided by the authors of [29]. 


Implementation and Training Details. We extracted 14x 14x 1024 grid features for 
the images in the CLEVR-X dataset using a ResNet-101 [19], pre-trained on Ima- 
geNet [12]. These grid features served as inputs to the FM [53] and PJ-X [26] frame- 
works. The CLEVR-X explanations are lower case and punctuation is removed from 
the sentences. We selected the best model on the CLEVR-X validation set based on the 
highest mean of the four NLG metrics, where explanations for incorrect answers were 
set to an empty string. This metric accounts for the answering performance as well as 
for the explanation quality. The final models were evaluated on the CLEVR-X test set. 
For PJ-X, our best model was trained for 52 epochs, using the Adam optimizer [33] 
with a learning rate of 0.0002 and a batch size of 256. We did not use gradient clipping 
for PJ-X. Our strongest FM model was trained for 30 epochs, using the Adam optimizer 


82 L. Salewski et al. 


with a learning rate of 0.0002, a batch size of 128, and gradient clipping of 0.1. All other 
hyperparameters were taken from [26,53]. 


Evaluation Metrics. To evaluate the quality of the generated explanations, we use 
the standard natural language generation metrics BLEU [39], METEOR [9], ROUGE- 
L [36] and CIDEr [50]. By design, there is no correct explanation that can justify a 
wrong answer. We follow [29] and report the quality of the generated explanations for 
the subset of correctly answered questions. 


4.2 Evaluating Explanations Generated by State-of-the-Art Methods 


In this section, we present quantitative results for generating explanations for the 
CLEVR-X dataset (Table 3). The random words baseline exhibits weak explanation 
performance for all NLG metrics on CLEVR-X. Additionally, the random answering 
accuracy is very low at 3.6%. The results are similar on VQA-X and e-SNLI-VE. The 
random explanations baseline achieves stronger explanation results on all three datasets, 
but is still significantly worse than the trained models. This confirms that, even with a 
medium-sized answer space (28 options) and a small vocabulary (96 words), it is not 
possible to achieve good scores on our dataset using a trivial approach. 

We observed that the PJ-X model yields a significantly stronger performance on 
CLEVR-X in terms of the NLG metrics for the generated explanations compared to 
the FM model, with METEOR scores of 58.9 and 52.5 for PJ-X and FM respectively. 
Across all explanation metrics, the scores on the VQA-X and e-SNLI-VE datasets are 
in a lower range than those on CLEVR-X. For PJ-X, we obtain a CIDEr score of 639.8 
on CLEVR-X and 82.7 and 72.5 on VQA-X and e-SNLI-VE. This can be attributed to 
the smaller vocabulary and longer sentences, which allow n-gram based metrics (e.g. 
BLEU) to match parts of sentences more easily. 

In contrast to the explanation generation performance, the FM model is bet- 
ter at answering questions than PJ-X on CLEVR-X with an answering accuracy of 
80.3% for FM compared to 63.0% for PJ-X. Compared to recent models tuned to 
the CLEVR task, the answering performances of PJ-X and FM do not seem very 
strong. However, the PJ-X backbone MCB [16] (which is crucial for the answering 
performance) preceded the publication of the CLEVR dataset. A version of the MCB 
backbone (CNN+LSTM+MCB in the CLEVR publication [27]) achieved an answer- 
ing accuracy of 51.4% on CLEVR [27], whereas PJ-X is able to correctly answer 
63% of the questions. The strongest model discussed in the initial CLEVR publication 
(CNN+LSTM+SA in [27]) achieved an answering accuracy of 68.5%. 


4.3 Analyzing Results on CLEVR-X by Question and Answer Types 


In Fig. 6 (left and middle), we present the performance for PJ-X on CLEVR-X for the 
nine question and three answer types. The explanation results for samples which require 
counting abilities (counting answers) are lower than those for attribute answers (57.3 vs. 
63.3). This is in line with prior findings that VQA models struggle with counting prob- 
lems [48]. The explanation quality for binary questions is even lower with a METEOR 
score of only 55.6. The generated explanations are of higher quality for easier ques- 
tion types; zero-hop questions yield a METEOR score of 64.9 compared to 62.1 for 
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Table 3. Explanation generation results on the CLEVR-X, VQA-X, and e-SNLI-VE test sets 
using BLEU-4 (B4), METEOR (M), ROUGE-L (RL), CIDEr (C), and answer accuracy (Acc). 
Higher is better for all reported metrics. For the random baselines, Acc corresponds to 100/¥ answers 
for CLEVR-X and e-SNLI-VE, and to the VQA answer score for VQA-X. (Rnd. words: random 
words, Rnd. expl: Random explanations) 


Model CLEVR-X VQA-X e-SNLI-VE 
ode 

B4 M RL C Acc| B4 M RL C Acc|B4 M RL C Acc 
Rnd. words 0.0 8.4 11.4 5.9 3.6|0.0 12 0.7 0.1 0.10.0 0.3 0.0 0.0 33.3 
Rnd. expl 10.9 16.6 35.3 30.4 3.6|0.9 6.5 18.4 21.6 0.2 |0.4 54 99 2.6 33.3 
FM [53] 78.8 52.5 85.8 566.8 80.3|23.1 20.4 47.1 87.0 75.5|8.2 15.6 29.9 83.6 58.5 
PJ-X [26] 87.4 58.9 93.4 639.8 63.0|22.7 19.7 46.0 82.7 76.4|7.3 14.7 28.6 72.5 69.2 


three-hop questions. It can also be seen that single-or questions are harder to explain 
than single-and questions. These trends can be observed across all NLG explanation 
metrics. 


4.4 Influence of Using Different Numbers of Ground-Truth Explanations 


In this section, we study the influence of using multiple ground-truth explanations for 
evaluation on the behavior of the NLG metrics. This gives insights about whether the 
metrics can correctly rate a model’s performance with a limited number of ground- 
truth explanations. We set an upper bound k on the number of explanations used and 
randomly sample k explanations if a test sample has more than k explanations for k € 
{1,2,...,10}. Figure 6 (right) shows the NLG metrics (normalized with the maximum 
value for each metric on the test set for all ground-truth explanations) for the PJ-X 
model depending on the average number of ground-truth references used on the test set. 
Out of the four metrics, BLEU-4 converges the slowest, requiring close to 3 ground- 
truth explanations to obtain a relative metric value of 95%. Hence, BLEU-4 might not 
be able to reliably predict the explanation quality on the e-SNLI-VE dataset which has 
only one explanation for each test sample. CIDEr converges faster than ROUGE and 
METEOR, and achieves 95.7% of its final value with only one ground-truth explana- 
tion. This could be caused by the fact, that CIDEr utilizes a tf-idf weighting scheme 
for different words, which is built from all reference sentences in the subset that the 
metric is computed on. This allows CIDEr to be more sensitive to important words 
(e.g. attributes and shapes) and to give less weight, for instance, to stopwords, such as 
“the”. The VQA-X and e-SNLI-VE datasets contain much lower average numbers of 
explanations for each dataset sample (1.4 and 1.0). Since there could be many more 
possible explanations for samples in those datasets that describe different aspects than 
those mentioned in the ground truth, automated metric may not be able to correctly 
judge a prediction even if it is correct and faithful w.r.t. to the image and question. 
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Fig. 6. Explanation generation results for PJ-X on the CLEVR-X test set according to question 
(left) and answer (middle) types compared to the overall explanation quality. Easier types yield 
higher METEOR scores. NLG metrics using different numbers of ground-truth explanations on 
the CLEVR-X test set (right). CIDEr converges faster than the other NLG metrics. 


4.5 Qualitative Explanation Generation Results 


We show examples for explanations generated with the PJ-X framework on CLEVR-X 
in Fig.7. As can be seen across the three examples presented, PJ-X generates high- 
quality explanations which closely match the ground-truth explanations. 

In the left-most example in Fig. 7, we can observe slight variations in grammar when 
comparing the generated explanation to the ground-truth explanation. However, the con- 
tent of the generated explanation corresponds to the ground truth. Furthermore, some 
predicted explanations differ from the ground-truth explanation in the use of another 
synonym for a predicted attribute. For instance, in the middle example in Fig. 7, the 
ground-truth explanation describes the size of the cylinder as “small”, whereas the pre- 
dicted explanation uses the equivalent attribute “tiny”. In contrast to other datasets, the 
set of ground-truth explanations for each sample in CLEVR-X contains these variations. 
Therefore, the automated NLG metrics do not decrease when such variations are found 
in the predictions. For the first and second example, PJ-X obtains the highest possible 
explanation score (100.0) in terms of the BLEU-4, METEOR, and ROUGE-L metrics. 

We show a failure case where PJ-X predicted the wrong answer in Fig. 7 (right). The 
generated answer-explanation pair shows that the predicted explanation is consistent 
with the wrong answer prediction and does not match the input question-image pair. 
The NLG metrics for this case are significantly weaker with a BLEU-4 score of 0.0, as 
there are no matching 4-grams between the prediction and the ground truth. 
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Question: How many tiny red 
things are the same material as 
the big sphere? 


GT Answer | Explanation: 

1 | The tiny red metal block 
has the same material as a big 
sphere. 

Pred. Answer | Expl. 

1 | There is the tiny red metal 
block which has the identical 


Question: The cylinder has 


what size? 


Question: Are there any small 
matte cubes? 


GT Answer | Explanation: 


Small | The cylinder is small. 


Pred. Answer | Expl. 
Small | The cylinder is tiny. 


GT Answer | Explanation: 
No | There are no small matte 
cubes. 


Pred. Answer | Expl. 
Yes | There is a small matte 
cube. 


material as a big sphere. 
B4/M/RL/C: 
100.0 / 100.0 / 100.0 / 744.0 


B4/M/RL/C: 
0.0 / 76.9 / 57.1 / 157.1 


B4/M/RL/C: 
100.0 / 100.0 / 100.0 / 462.4 


Fig.7. Examples for answers and explanations generated with the PJ-X framework on the 
CLEVR-X dataset, showing correct answer predictions (left, middle) and a failure case (right). 
The NLG metrics obtained with the explanations for the correctly predicted answers are high 
compared to those for the explanation corresponding to the wrong answer prediction. 


5 Conclusion 


We introduced the novel CLEVR-X dataset which contains natural language explana- 
tions for the VQA task on the CLEVR dataset. Our user study confirms that the expla- 
nations in the CLEVR-X dataset are complete and match the questions and images. 
Furthermore, we have provided baseline performances using the PJ-X and FM frame- 
works on the CLEVR-X dataset. The structured nature of our proposed dataset allowed 
the detailed evaluation of the explanation generation quality according to answer and 
question types. We observed that the generated explanations were of higher quality 
for easier answer and question categories. One of our findings is, that explanations 
for counting problems are worse than for other answer types, suggesting that further 
research into this direction is needed. Additionally, we find that the four NLG metrics 
used to evaluate the quality of the generated explanations exhibit different convergence 
patterns depending on the number of available ground-truth references. 

Since this work only considered two natural language generation methods for VQA 
as baselines, the natural next step will be the benchmarking and closer investigation 
of additional recent frameworks for textual explanations in the context of VQA on the 
CLEVR-X dataset. We hope that our proposed CLEVR-X benchmark will facilitate fur- 
ther research to improve the generation of natural language explanations in the context 
of vision-language tasks. 
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Abstract. We present the Rate-Distortion Explanation (RDE) frame- 
work, a mathematically well-founded method for explaining black-box 
model decisions. The framework is based on perturbations of the target 
input signal and applies to any differentiable pre-trained model such as 
neural networks. Our experiments demonstrate the framework’s adapt- 
ability to diverse data modalities, particularly images, audio, and phys- 
ical simulations of urban environments. 


1 Introduction 


Powerful machine learning models such as deep neural networks are inherently 
opaque, which has motivated numerous explanation methods that the research 
community developed over the last decade [1,2,7,15,16,20, 26,29]. The mean- 
ing and validity of an explanation depends on the underlying principle of the 
explanation framework. Therefore, a trustworthy explanation framework must 
align intuition with mathematical rigor while maintaining maximal flexibility 
and applicability. We believe the Rate-Distortion Explanation (RDE) frame- 
work, first proposed by [16], then extended by [9], as well as the similar frame- 
work in [2], meets the desired qualities. In this chapter, we aim to present 
the RDE framework in a revised and holistic manner. Our generalized RDE 
framework can be applied to any model (not just classification tasks), supports 
in-distribution interpretability (by leveraging in-painting GANs), and admits 
interpretation queries (by considering suitable input signal representations). 

The typical setting of a (local) explanation method is given by a pre-trained 
model $ : R” — R”, and a data instance x € R”. The model @ can be either 
a classification task with m class labels or a regression task with m-dimensional 
model output. The model decision (x) is to be explained. In the original RDE 
framework [16], an explanation for (x) is a set of feature components S C 
{1,...,n} in z that are deemed relevant for the decision (x). The core principle 
behind the RDE framework is that a set S C {1,...,n} contains all the relevant 
components if P(x) remains (approximately) unchanged after modifying gc, i.e., 
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the components in x that are not deemed relevant. In other words, S contains all 
relevant features if they are sufficient for producing the output (x). To convey 
concise explanatory information, one aims to find the minimal set S C {1,...,n} 
with all the relevant components. As demonstrated in [16] and [31], the minimal 
relevant set S C {1,...,n} cannot be found combinatorically in an efficient 
manner for large input sizes. A meaningful approximation can nevertheless be 
found by optimizing a sparse continuous mask s € [0, 1]” that has no significant 
effect on the output (x) in the sense that P(x) ~ O(a © s + (1—s) ©v) should 
hold for appropriate perturbations v € R”, where © denotes the componentwise 
multiplication. Suppose d(®(x), ®(y)) is a measure of distortion (e.g. the l2- 
norm) between the model outputs for x,y € R” and V is a distribution over 
appropriate perturbations v ~ V. An explanation in the RDE framework can be 
found as a solution mask s* to the following minimization problem: 


+ Allslli, 


s*:= argmin E d(®x), (a O©st(1—s)© v)) 
se[0,1]” vr 


where A > 0 is a hyperparameter controlling the sparsity of the mask. 

We further generalize the RDE framework to abstract input signal repre- 
sentations x = f(h), where f is a data representation function with input h. 
The philosophy of the generalized RDE framework is that an explanation for 
generic input signals x = f(h) should be some simplified version of the signal, 
which is interpretable to humans. This is achieved by demanding sparsity in a 
suitable representation system h, which ideally optimally represents the class 
of explanations that are desirable for the underlying domain and interpretation 
query. This philosophy underpins our experiments on image classification in the 
wavelet domain, on audio signal classification in the Fourier domain, and on radio 
map estimation in an urban environment domain. Therein we demonstrate the 
versatility of our generalized RDE framework. 


2 Related Works 


To our knowledge, the explanation principle of optimizing a mask s € [0, 1]” has 
been first proposed in [7]. Fong et al. [7] explained image classification decisions 
by considering one of the two “deletion games”: (1) optimizing for the smallest 
deletion mask that causes the class score to drop significantly or (2) optimizing 
for the largest deletion mask that has no significant effect on the class score. 
The original RDE approach [16] is based on the second deletion game and con- 
nects the deletion principle to rate-distortion-theory, which studies lossy data 
compression. Deleted entries in [7] were replaced with either constants, noise, or 
blurring and deleted entries in [16] were replaced with noise. 

Explanation methods introduced before the “deletion games” principle from 
[7] were typically based upon gradient-based methods [26,29], propagation of 
activations in neurons [1,25], surrogate models [20], and game-theory [15]. 
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Gradient-based methods such as smoothgrad [26] suffer from a lacking princi- 
ple of relevance beyond local sensitivity. Reference-based methods such as Inte- 
grated Gradients [29] and DeepLIFT [25] depend on a reference value, which 
has no clear optimal choice. DeepLIFT and LRP assign relevance by propagat- 
ing neuron activations, which makes them dependent on the implementation of 
®. LIME [20] uses an interpretable surrogate model that approximates ® in a 
neighborhood around gx. Surrogate model explanations are inherently limited 
for complex models ® (such as image classifiers) as they only admit very local 
approximations. Generally, explanations that only depend on the model behavior 
on a small neighborhood U, of x offer limited insight. Lastly, Shapley values- 
based explanations [15] are grounded in Shapley values from game-theory. They 
assign relevance scores as weighted averages of marginal contributions of respec- 
tive features. Though Shapley values are mathematically well-founded, relevance 
scores cannot be computed exactly for common input sizes such as n > 50, since 
one exact relevance score generally requires O(2") evaluations of & [30]. 

A notable difference between the RDE method and additive feature explana- 
tions [15] is that the values in the mask s* do not add up to the model output. The 
additive property as in [15] takes the view that features individually contribute 
to the model output and relevance should be reflected by their contributions. 
We emphasize that the RDE method is designed to look for a set of relevant 
features and not an estimate of individual relative contributions. This is partic- 
ularly desirable when only groups of features are interpretable, as for example in 
image classification tasks, where individual pixels do not carry any interpretable 
meaning. Similarly to Shapley values, the explanation in the RDE framework 
cannot be computed exactly, as it requires solving a non-convex minimization 
problem. However, the RDE method can take full advantage of modern optimiza- 
tion techniques. Furthermore, the RDE method is a model-agnostic explanation 
technique, with a mathematically principled and intuitive notion of relevance as 
well as enough flexibility to incorporate the model behavior on meaningful input 
regions of ®. 

The meaning of an explanation based on deletion masks s € [0, 1]” depends on 
the nature of the perturbations that replace the deleted regions. Random [7,16] 
or blurred [7] replacements v € R” may result in a data point rOs+(1—s)Ov 
that falls out of the natural data manifold on which @ was trained on. This is 
a subtle though important problem, since such an explanation may depend on 
evaluations of ® on data points from undeveloped decision regions. The latter 
motivates in-distribution interpretability, which considers meaningful perturba- 
tions that keep z © s + (1 — s) © v in the data manifold. [2] was the first work 
that suggested to use an inpainting-GAN to generate meaningful perturbations 
to the “deletion games”. The authors of [9] then applied in-distribution inter- 
pretability to the RDE method in the challenging modalities music and physical 
simulations of urban environments. Moreover, they demonstrated that the RDE 
method in [16] can be extended to answer so-called “interpretation queries”. 
For example, the RDE method was applied in [9] to an instrument classifier to 
answer the global interpretation query “Js magnitude or phase in the signal more 
important for the classifier?”. Most recently, in [11], we introduced CartoonX 
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as a novel explanation method for image classifiers, answering the interpretation 
query “What is the relevant piece-wise smooth part of an image?” by applying 
RDE in the wavelet basis of images. 


3 Rate-Distortion Explanation Framework 


Based on the original RDE approach from [16], in this section, we present a 
general formulation of the RDE framework and discuss several implementations. 
While [16] focuses merely on image classification with explanations in pixel rep- 
resentation, we will apply the RDE framework not only to more challenging 
domains but also to different input signal representations. Not surprisingly, the 
combinatorical optimization problem in the RDE framework, even in simpler 
form, is extremely hard to solve [16,31]. This motivates heuristic solution strate- 
gies, which will be discussed in Subsect. 3.2. 


3.1 General Formulation 


It is well-known that in practice there are different ways to describe a signal 
x € R”. Generally speaking, x can be represented by a data representation 
function f : IŁ, R% > R”, 


x= f(hi,.-., hg), (1) 


for some inputs h; € R“, d; € N, i € {1,...,k}, k € N. Note, we do not 
restrict ourselves to linear data representation functions f. To briefly illustrate 
the generality of this abstract representation, we consider the following examples. 


Example 1 (Pizel representation). An arbitrary (vectorized) image x € R” can 
be simply represented pixelwise 


Tı 
g= |S | = f hitsa hin) 


Tn 


with h; = x; being the individual pixel values and f: R” — R” being the 
identity transform. 


Due to its simplicity, this standard basis representation is a reasonable choice 
when explaining image classification models. However, in many other applica- 
tions, one requires more sophisticated representations of the signals, such as 
through a possibly redundant dictionary. 


Example 2. Let {dy} Fo 


x € R” is represented as 


k € N, be a dictionary in R”, e.g., a basis. A signal 


k 
z=) hybj, 
j=1 
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where h; € R, j € {1,...,k}, are appropriate coefficients. In terms of the 
abstract representation (1), we have dj = 1 for j € {1,...,k} and f is the 
function that yields the weighted sum over 7,;. Note that Example 1 can be seen 
as a special case of this representation. 


The following gives an example of a non-linear representation function f. 


Example 3. Consider the discrete inverse Fourier transform, defined as 


ee ][ = x [0,27 >C”, 


j=1 j=1 


rie EREE E E 
[f(m, voy Mn, W1, Wn); = — X mje gen l e Iesus mk 
n 


j=l :=cjEC 


where m; and wj are respectively the magnitude and the phase of the j-th 
discrete Fourier coefficient cj. Thus every signal x € R” C C” can be represented 
in terms of (1) with f being the discrete inverse Fourier transform while hj, 
j=1,...,k (with k = 2n) being specified as m; and wy, j) =1,...,n. 
Further examples of dictionaries {ọ%; jan include the discrete wavelet [21], cosine 
[19] or shearlet [12] representation systems and many more. In these cases, the 
coefficients h; are given by the forward transform and f is referred to as the 
backward transform. Note that in the above examples we have d; = 1, i.e., the 
input vectors h; are real-valued. In many situations, one is also interested in 
representations = f(h1,..., hp) with h; € R where di; > 1. 


Example 4. Let k = 2 and define f again as the discrete inverse Fourier trans- 
form, but as a function of two components: (1) the entire magnitude spectrum 
and (2) the entire frequency spectrum, namely 


f: R} x (0, 27)”, 


1 TE E 
[f(m,w)], = = So mje ePTIG-D)/n 1 E {1,..., n}. 
n j= —S 
Similarly, instead of individual pixel values, one can consider patches of pixels 
in an image x € R” from Example 1 as the input vectors h; to the identity 
transform f. We will come back to these examples in the experiments in Sect. 4. 


Finally, we would like to remark that our abstract representation 
x= f(hi,..., hg) 


also covers the cases where the signal is the output of a decoder or generative 
model f with inputs h;,...,h, as the code or the latent variables. 

As was discussed in previous sections, the main idea of the RDE framework 
is to extract the relevant features of the signal based on the optimization over its 
perturbations defined through masks. The ingredients of this idea are formally 
defined below. 
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Definition 1 (Obfuscations and expected distortion). Let 6: R” — R™ 
be a model and x € R” a data point with a data representation x = f (hı, ..., hk) 
as discussed above. For every mask s € [0,1]*, let V, be a probability distribution 
over m, R2. Then the obfuscation of x with respect to s and V, is defined as 
the random vector 

y := f(sOh+(1—-s) Ov), 


where v ~ Vz, (S ©h)i = sihi € R® and ((1 — s) © vj = (1 — si)v; € RË 
fori € {1,...,k}. Furthermore, the expected distortion of x with respect to the 
mask s and the perturbation distribution Vs is defined as 


UN 


D(a, 8,V.,8) = E (2.000) | 


s 


where d : R™ x R™ — R, is a measure of distortion between two model outputs. 


In the RDE framework, the explanation is given by a mask that minimizes dis- 
tortion while remaining relatively sparse. The rate-distortion-explanation mask 
is defined in the following. 


Definition 2 (The RDE mask). In the setting of Definition 1 we define the 
RDE mask as a solution s*(£) to the minimization problem 


j D(x,8,V;,®) s.t. < 2, 2 
ain, Dls VD) st. [slo (2) 


where L € {1,...,k} is the desired level of sparsity. 


Here, the RDE mask is defined as the binary mask that minimizes the expected 
distortion while keeping the sparsity smaller than a certain threshold. Besides 
this, one could obviously also define the RDE mask as the sparsest binary mask 
that keeps the distortion lower than a given threshold, as defined in [16]. Geo- 
metrically, one can interpret the RDE mask as a subspace that is stable under 
®. If x = f(h) is the input signal and s is the RDE mask for (x) on the coef- 
ficients h, then the associated subspace Rg(s) is defined as the space of feasible 
obfuscations of x with s under Vs, i.e., 


Ro(s) :={f(sOh+(1—s) Ov) | v € suppYs}, 


where suppY, denotes the support of the distribution Vs. The model ® will act 
similarly on signals in Ro(s) due to the low expected distortion D(x, s, Vs, P)— 
making the subspace stable under &. Note that RDE directly optimizes towards 
a subspace that is stable under ®. If, instead, one would choose the mask s 
based on information of the gradient V@(x) and Hessian V?@(x), then only 
a local neighborhood around x would tend to be stable under @ due to the 
local nature of the gradient and Hessian. Before discussing practical algorithms 
to approximate the RDE mask in Subsect. 3.2, we will review frequently used 
obfuscation strategies, i.e., the distribution V,, and measures of distortion. 
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3.1.1 Obfuscation Strategies and in-Distribution Interpretability 
The meaning of an explanation in RDE depends greatly on the nature of the 
perturbations v ~ V,. A particular choice of V, defines an obfuscation strategy. 
Obfuscations are either in-distribution, i.e., if the obfuscation f(sOh+(1—s)©v) 
lies on the natural data manifold that ® was trained on, or out-of-distribution 
otherwise. Out-of-distribution obfuscations pose the following problem. The 
RDE mask (see Definition2) depends on evaluations of ® on obfuscations 
f(sOh+(1—s) Ov). If f(sOh+ (1-8) ©v) is not on the natural data 
manifold that ® was trained on, then it may lie in undeveloped regions of ®. 
In practice, we are interested in explaining the behavior of @ on realistic data 
and an explanation can be corrupted if ® did not develop the region of out-of 
distribution points f(s©h+(1—s)©v). One can guard against this by choosing 
V, so that f(s Oh+(1 — s) © v) is in-distribution. Choosing V, in-distribution 
boils down to modeling the conditional data distribution — a non-trivial task. 


Example 5 (In-distribution obfuscation strategy). In light of the recent success 
of generative adversarial networks (GANs) in generative modeling [8], one can 
train an in-painting GAN [32] 


k 
G(h, 8,2) € II RË, 
i=1 


where z are random latent variables of the GAN, such that the obfuscation 
f(sOh+(1—s)©G(h,s,z)) lies on the natural data manifold (see also [2]). In 
other words, one can choose V, as the distribution of v := G(h, s,z), where the 
randomness comes from the random latent variables z. 


Example 6 (Out-of-distribution obfuscation strategies). A very simple obfusca- 
tion strategy is Gaussian noise. In that case, one defines V, for every s € [0,1]* as 
V, = N (u, X), where u and X denote a pre-defined mean vector and covariance 
matrix. In Sect. 4.1, we give an example of a reasonable choice for u and X for 
image data. Alternatively, for images with pixel representation (see Example 1) 
one can mask out the deleted pixels by blurred inputs, v = K * x, where K is a 
suitable blur kernel. 


Table 1. Common obfuscation strategies with their perturbation formulas. 


Obfuscation strategy Perturbation formula | In-distribution 
Constant v eR? E 
Noise v~ N (u, X) = 
Blurring v=K*»r = 
Inpainting-GAN v = G(h, s, z) v 


We summarize common obfuscation strategies for a given target signal in 
Table 1. 
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3.1.2 Measure of Distortion 

Various options exist for the measure d: R™ x R™ — R of the distortion between 
model outputs. The measure of distortion should be chosen according to the task 
of the model @: R” — R” and the objective of the explanation. 


Example 7 (Measure of distortion for classification task). Consider a classifica- 
tion model @: R” — R™ and a target input signal x € R”. The model @ assigns 
to each class j € {1,...,m} a (pre-softmax) score ®;(x) and the predicted label 
is given by j* := arg max;je{1,...m} Pj (x). One commonly used measure of the 
distortion between the outputs at x and another data point y € R” is given as 


2 
dy (P(x), P(y)) = (Pj (x) — Bj- (y)) 

On the other hand, the vector [$;(x)]}}; is usually normalized to a prob- 
ability vector (8; (x), by applying the softmax function, namely (x) = 
exp ®;(x)/ 57", exp ®;(x). This, in turn, gives another measure of the distor- 
tion between S(x), &(y) € R™, namely 


z z 2 
dz(P(x), P(y)) = (8; (2) — Bj- (y)) 

where j* := arg maxje{1,...m} Oj(@) = arg maxje{1,... m} ®;(x). An important 

property of the softmax function is the invariance under translation by a vector 

[c,...,¢e]' € R”, where c € R is a constant. By definition, only dz respects this 

invariance while dı does not. 


Example 8 (Measure of distortion for regression task). Consider a regression 
model $ : R” — R™ and an input signal x € R”. One can then define the 
measure of distortion between the outputs of x and another data point y € R” 
as 
2 
ds ((P(x), P(y)) = ||P(x) — Ply) [Io - 


Sometimes it is reasonable to consider a certain subset of components J C 
{1,...,m} of the output vectors instead of all m entries. Denoting the vector 
formed by corresponding entries by ®;(x), the measure of distortion between 
the outputs can be defined as 


da ((P(x), P(y)) = ||By(x) — Syy). 


The measure d4 will be used in our experiments for radio maps in Subsect. 4.3. 


3.2 Implementation 
The RDE mask from Definition 2 was defined as a solution to 


i D(a,8,Vs,®) s-t. <. 
Eor (x, s ) s IIsllo 


In practice, we need to relax this problem. We offer the following three 
approaches. 
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3.2.1 ,-relaxation with Lagrange Multiplier 
The RDE mask can be approximately computed by finding an approximate 
solution to the following relaxed minimization problem: 


i D(x, s, Vs, P) +A j P 
min, D(e,s,Vs, 8) + Alsll (Pi) 


where A > 0 is a hyperparameter for the sparsity level. Note that the optimiza- 
tion problem is not necessarily convex, thus the solution might not be unique. 

The expected distortion D(x, s, V.,®) can typically be approximated with 
simple Monte-Carlo estimates, i.e., by averaging i.i.d. samples from V,. After 
estimating D(x, s, V., ®), one can optimize the mask s with stochastic gradient 
descent (SGD) to solve the optimization problem (P1). 


3.2.2 Bernoulli Relaxation 

By viewing the binary mask as Bernoulli random variables s ~ Ber(0) and 
optimizing over 0, one can guarantee that the expected distortion D(x, s, Vs, 8) 
is evaluated on binary masks s € {0,1}”. To encourage sparsity of the resulting 
mask, one can still apply ¢1-regularization on s, giving rise to the following 
optimization problem: 


min y D(a, s,Vs,&) + X|Is||, l- (P2) 
6E[0,1]*  s~Ber(0) 


Optimizing the parameter 0 requires a continuous relaxation to apply SGD. 
This can be done using the concrete distribution [17], which samples s from a 
continuous relaxation of the Bernoulli distribution. 


3.2.3 Matching Pursuit 

As an alternative, one can also perform matching pursuit [18]. Here, the non- 
zero entries of s € {0,1}”" are determined sequentially in a greedy fashion to 
minimize the resulting distortion in each step. More precisely, we start with a 
zero mask s? = 0 and gradually build up the mask by updating st at step t by 
the rule given by 


sit) = st + arg min D(z, s’ + ej, Vs, ®). 


or 
ej: 8;=0 


Here, the minimization is taken over all standard basis vectors e; € R* with 
si = 0. The algorithm terminates when reaching some desired error tolerance 
or after a prefixed number of iterations. While this means that in each iteration 
we have to test every entry of s, it is applicable when k is small or when we are 
only interested in very sparse masks. 
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4 Experiments 


With our experiments, we demonstrate the broad applicability of the general- 
ized RDE framework. Moreover, our experiments illustrate how different choices 
of obfuscation strategies, optimization procedures, measures of distortion, and 
input signal representations, discussed in Sect. 3.1, can be leveraged in practice. 
We explain model decisions on various challenging data modalities and tailor 
the input signal representation and measure of distortion to the domain and 
interpretation query. In Sect. 4.1, we focus on image classification, a common 
baseline task in the interpretability literature. In Sects. 4.2 and 4.3, we consider 
two other data modalities that are often unexplored. Section 4.2 focuses on audio 
data, where the underlying task is to classify acoustic instruments based on a 
short audio sample of distinct notes, while in Sect. 4.3, the underlying task is 
a regression with data in the form of physical simulations in urban environ- 
ments. We also believe our explanation framework sustains applications beyond 
interpretability tasks. An example is given in Sect. 4.3.2, where we add an RDE 
inspired regularizer to the training objective of a radio map estimation model. 


4.1 Images 


We begin with the most ordinary domain in the interpretability literature: image 
classification tasks. The authors of [16] applied RDE to image data before by 
considering pixel-wise perturbations. We refer to this method as Pizel RDE. 
Other explanation methods [13,20], have also previously exclusively operated 
in the pixel domain. In [11], we challenged this customary practice by success- 
fully applying RDE in a wavelet basis, where sparsity translates into piece-wise 
smooth images (also called cartoon-like images). The novel explanation method 
was coined CartoonX [11] and extracts the relevant piece-wise smooth part of 
an image. First, we review the Pixel RDE method and present experiments on 
the ImageNet dataset [4], which is commonly considered a challenging classifica- 
tion task. Finally, we present CartoonX and discuss its advantages. For all the 
ImageNet experiments, we use the pre-trained MobileNetV3-Small [10], which 
achieved a top-1 accuracy of 67.668% and a top-5 accuracy of 87.402%, as the 
classifier. 


4.1.1 Pixel RDE 

Consider the following pixel-wise representation of an RGB image x € R?*”: 
f: ILR? > R”*3, x = f(hi,..., hn), where h; € R? represents the three 
color channel values of the i-th pixel in the image x, i.e. (£i j)j=1,..,3 = hi. 
In pixel RDE a sparse mask s € [0,1]” with n entries—one for each pixel—is 
optimized to achieve low expected distortion D(x, s, Vs, 8). The obfuscation of 
an image x with the pixel mask s and a distribution v ~ V, on [];_, R° is 
defined as f(s © h + (1 — s) © v). In our experiments, we initialize the mask 
with ones, i.e., si = 1 for every i € {1,...,n}, and consider Gaussian noise 
perturbations V, = N (u, X). We set the noise mean u € RX” as the pixel value 
mean of the original image x and the covariance matrix X := go? Id € R°"*3" as 
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(g) 


Fig. 1. Top row: original images correctly classified as (a) snail, (b) male duck, and 
(c) airplane. Middle row: Pixel RDEs. Bottom row: CartoonX. Notably, CartoonX is 
roughly piece-wise smooth and overall more interpretable than the jittery Pixel RDEs. 


a diagonal matrix with ø > 0 defined as the pixel value standard deviation of the 
original image x. We then optimize the pixel mask s for 2000 gradient descent 
steps on the ¢,-relaxation of the RDE objective (see Sect. 3.2.1). We computed 
the distortion d(@(z),®(y)) in D(x, s, Vs, P) in the post-softmax activation of 
the predicted label multiplied by a constant C = 100, i.e., d(®(x),P(y)) = 
O(S; (2) — B-(y)) 

The expected distortion D(x,s,V,,®) was approximated as a simple Monte- 
Carlo estimate after sampling 64 noise perturbations. For the sparsity level, we 
set the Lagrange multiplier to A = 0.6. All images were resized to 256 x 256 
pixels. The mask was optimized for 2000 steps using the Adam optimizer with 
step size 0.003. In the middle row of Fig. 1, we show three example explanations 
with Pixel RDE for an image of a snail, a male duck, and an airplane, all from 
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the ImageNet dataset. Pixel RDE highlights as relevant both the snail’s inner 
shell and part of its head, the lower segment of the male duck along with various 
lines in the water, and the airplane’s fuselage and part of its rudder. 


(a) (b) 


Fig. 2. Discrete Wavelet Transform of an image: (a) original image (b) discrete wavelet 
transform. The coefficients of the largest quadrant in (b) correspond to the lowest scale 
and coefficients of smaller quadrants gradually build up to the highest scales, which 
are located in the four smallest quadrants. Three nested L-shaped quadrants represent 
horizontal, vertical and diagonal edges at a resolution determined by the associated 
scale. 


4.1.2 CartoonX 

Formally, we represent an RGB image x € [0,1]°*” in its wavelet coefficients 
h = {hi} € [[f_, R? with J € {1,..., [log, n]} scales as x = f(h), where f is 
the discrete inverse wavelet transform. Each h; = (hi,-)3_, C R? contains three 
wavelet coefficients of the image, one for each color channel and is associated 
with a scale k; € {1,...,J} and a position in the image. Low scales describe 
high frequencies and high scales describe low frequencies at the respective image 
position. We briefly illustrate the wavelet coefficients in Fig. 2, which visualizes 
the discrete wavelet transform of an image. CartoonX [11] is a special case of 
the generalized RDE framework, particularly a special case of Example 2, and 
optimizes a sparse mask s € [0,1]” on the wavelet coefficients (see Fig. 3c) so 
that the expected distortion D(x, s, Vs,®) remains small. The obfuscation of 
an image x with a wavelet mask s and a distribution v ~ V, on the wavelet 
coefficients is f(s © h + (1 — s) © v). In our experiments, we used Gaussian 
noise perturbations and chose the standard deviation and mean adaptively for 
each scale: the standard deviation and mean for wavelet coefficients of scale 
j € {1,...,J} were chosen as the standard deviation and mean of the wavelet 
coefficients of scale j € {1,...,J} of the original image. Figure 3d shows the 
obfuscation f(s © h + (1 — s) © v) with the final wavelet mask s after the RDE 
optimization procedure. In Pixel RDE, the mask itself is the explanation as it lies 
in pixel space (see middle row in Fig. 1), whereas the CartoonX mask lies in the 
wavelet domain. To go back to the natural image domain, we multiply the wavelet 


A Rate-Distortion Framework for Explaining Black-Box Model Decisions 103 


(d) (e) (f) 


Fig. 3. CartoonX machinery: (a) image classified as park-bench, (b) discrete wavelet 
transform of the image, (c) final mask on the wavelet coefficients after the RDE opti- 
mization procedure, (d) obfuscation with final wavelet mask and noise, (e) final Car- 
toonX, (f) Pixel RDE for comparison. 


mask element-wise with the wavelet coefficients of the original greyscale image 
and invert this product back to pixel space with the discrete inverse wavelet 
transform. The inversion is finally clipped into [0,1] as are obfuscations during 
the RDE optimization to avoid overflow (we assume here the pixel values in x are 
normalized into [0, 1]). The clipped inversion in pixel space is the final CartoonX 
explanation (see Fig. 3e). 

The following points should be kept in mind when interpreting the final 
CartoonX explanation, i.e., the inversion of the wavelet coefficient mask: (1) 
CartoonX provides the relevant pice-wise smooth part of the image. (2) The 
inversion of the wavelet coefficient mask was not optimized to be sparse in pixel 
space but in the wavelet basis. (3) A region that is black in the inversion could 
nevertheless be relevant if it was already black in the original image. This is due 
to the multiplication of the mask with the wavelet coefficients of the greyscale 
image before taking the discrete inverse wavelet transform. (4) Bright high res- 
olution regions are relevant in high resolution and bright low resolution regions 
are relevant in low resolution. (5) It is inexpensive for CartoonX to mark large 
regions in low resolution as relevant. (6) It is expensive for CartoonX to mark 
large regions in high resolution as relevant. 
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Fig. 4. Scatter plot of rate-distortion in pixel basis and wavelet basis. Each point is an 
explanation of a distinct image in the ImageNet dataset with distortion and normalized 
é,-norm measured for the final mask. The wavelet mask achieves lower distortion than 
the pixel mask, while using less coefficients. 


In Fig. 1, we compare CartoonX to Pixel RDE. The piece-wise smooth wavelet 
explanations are more interpretable than the jittery Pixel RDEs. In particular, 
CartoonX asserts that the snail’s shell without the head suffices for the classi- 
fication, unlike Pixel RDE, which insinuated that both the inner shell and part 
of the head are relevant. Moreover, CartoonX shows that the water gives the 
classifier context for the classification of the duck, which one could have only 
guessed from the Pixel RDE. Both Pixel RDE and CartoonX state that the head 
of the duck is not relevant. Lastly, CartoonX, like Pixel RDE, confirms that the 
wings play a subordinate role in the classification of the airplane. 


4.1.3. Why Explain in the Wavelet Basis? 

Wavelets provide optimal representation for piece-wise smooth 1D functions [5], 
and represent 2D piece-wise smooth images, also called cartoon-like images [12], 
efficiently as well [21]. Indeed, sparse vectors in the wavelet coefficient space 
encode cartoon-like images reasonably well [27], certainly better than sparse 
pixel representations. Moreover, the optimization process underlying CartoonX 
produces sparse vectors in the wavelet coefficient space. Hence CartoonX typ- 
ically generates cartoon-like images as explanations. This is the fundamental 
difference to Pixel RDE, which produces rough, jittery, and pixel-sparse expla- 
nations. Cartoon-like images are more interpretable and provide a natural model 
of simplified images. Since the goal of the RDE explanation is to generate an 
easy to interpret simplified version of the input signal, we argue that CartoonX 
explanations are more appropriate for image classification than Pixel RDEs. 
Our experiments confirm that the CartoonX explanations are roughly piece- 
wise smooth explanations and are overall more interpretable than Pixel RDEs 
(see Fig. 1). 
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4.1.4 CartoonX Implementation 

Throughout our CartoonX experiments we chose the Daubechies 3 wavelet sys- 
tem, J = 5 levels of scales and zero padding for the discrete wavelet transform. 
For the implementation of the discrete wavelet transform, we used the Pytorch 
Wavelets package, which supports gradient computation in Pytorch. Distortion 
was computed as in the Pixel RDE experiments. The perturbations v ~ Vs on 
the wavelet coefficients were chosen as Gaussian noise with standard deviation 
and mean computed adaptively per scale. As in the Pixel RDE experiments, the 
wavelet mask was optimized for 2000 steps with the Adam optimizer to minimize 
the ¢,-relaxation of the RDE objective. We used A = 3 for CartoonX. 


4.1.5 Efficiency of CartoonX 

Finally, we compare Pixel RDE to CartoonX quantitatively by analyzing the 
distortion and sparsity associated with the final explanation mask. Intuitively, 
we expect the CartoonX method to have an efficiency advantage, since the dis- 
crete wavelet transform already encodes natural images sparsely, and hence less 
wavelet coefficients are required to represent images than pixel coefficients. Our 
experiments confirmed this intuition, as can be seen in the scatter plot in Fig. 4. 


4.2 Audio 


We consider the NSynth dataset [6], a library of short audio samples of distinct 
notes played on a variety of instruments. We pre-process the data by comput- 
ing the power-normalized magnitude spectrum and phase information using the 
discrete Fourier transform on a logarithmic scale from 20 to 8000 Hertz. Each 
data instance is then represented by the magnitude and the phase of its Fourier 
coefficients as well as the discrete inverse Fourier transform (see Example 3). 


4.2.1 Explaining the Classifier 

Our model @ is a network trained to classify acoustic instruments. We compute 
the distortion with respect to the pre-softmax scores, i.e., deploy dı in Example 
7 as the measure of distortion. We follow the obfuscation strategy described in 
Example 5 and train an inpainter G to generate the obfuscation G(h, s, z). Here, 
h corresponds to the representation of a signal, s is a binary mask and z is a 
normally distributed seed to the generator. 

We use a residual CNN architecture for G with added noise in the input and 
deep features. More details can be found in Sect. 4.2.3. We train G until the 
outputs are found to be satisfactory, exemplified by the outputs in Fig. 5. 

To compute the explanation maps, we numerically solve (P2) as discussed in 
Subsect. 3.2. In particular, s is a binary mask indicating whether the phase and 
magnitude information of a certain frequency should be dropped and is specified 
as a Bernoulli variable s ~ Ber(@). We chose a regularization parameter of A = 50 
and minimized the corresponding objective using the Adam optimizer with a step 
size of 1075 in 10° iterations. For the concrete distribution, we used a temperature 
of 0.1. Two examples resulting from this process can be seen in Fig. 6. 
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Fig. 5. Inpainted Bass: Example inpainting from G. The bottom plot depicts phase 
versus frequency and the top plot depicts magnitude versus frequency. The random 
binary mask is represented by the green parts. The axes for the inpainted signal (black) 
and the original signal (blue dashed) are offset to improve visibility. Note how the 
inpainter generates plausible peaks in the magnitude and phase spectra, especially 
with regard to rapid (>600Hz) versus smooth (<270 Hz) changes in phase. (Color 
figure online) 


Notice here that the method actually shows a strong reliance of the classifier 
on low frequencies (30 Hz-60 Hz) to classify the top sample in Fig. 6 as a guitar, 
as only the guitar samples have this low frequency slope in the spectrum. We can 
also see in contrast that classifying the bass sample relies more on the continuous 
signal 100 Hz and 230 Hz. 


4.2.2 Magnitude vs Phase 

In the above experiment, we have represented the signals by the magnitude and 
phase information at each frequency, hence the mask s acts on each frequency. 
Now we consider the interpretation query of whether the entire magnitude spec- 
trum or the entire phase spectrum is more relevant for the prediction. Accord- 
ingly, we consider the representation discussed in Example 4 and apply the mask 
s to turn off or on the whole magnitude spectrum or the phase information. 
Furthermore, we can optimize s not only for one datum but for all samples 
from a class. This extracts the information whether magnitude or phase is more 
important for predicting samples from a specific class. 

For this, we again minimized (P2) (meaned over all samples of a class) with 0 
as the Bernoulli parameter using the Adam optimizer for 2 x 10° iterations with a 
step size of 1074 and the regularization parameter \ = 30. Again, a temperature 
of t = 0.1 was used for the concrete distribution. 

From the results of these computations, which can be seen in Table 2, we can 
observe that there is a clear difference on what the classifier bases its decision 
on across instruments. The classification of most instruments is largely based on 
phase information. For the mallet, the values are low for magnitude and phase, 
which means that the expected distortion is very low compared to the ¢,-norm of 
the mask, even when the signal is completely inpainted. This underlines that the 
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Fig. 6. Interpreting NSynth Model: The optimized importance parameter 0 (green) 
overlayed on top of the DFT (blue). For each of guitar and bass, the top graph shows 
the power-normalized magnitude and the bottom the phase. Notice the solid peaks 
30 Hz and 60 Hz for guitar and 100 Hz and 230 Hz for bass. These occur because the 
model is relying on those parts of the spectra, for the classification. Notice also how 
many parts of the spectrum are important even when the magnitude is near zero. 
This indicates that the model pays attention to whether those frequencies are missing. 
(Color figure online) 


regularization parameter À may have to be adjusted for different data instances, 
especially when measuring distortion in the pre-softmax scores. 


4.2.3 Architecture of the Inpainting Network G 

Here, we briefly describe the architecture of the inpainting network G that was 
used to generate obfuscations to the target signals. In particular, Fig. 7 shows 
the diagram of the network G and Table 3 shows information about its layers. 


4.3 Radio Maps 


In this subsection, we assume a set of transmitting devices (Tx) broadcasting 
a signal within a city. The received strength varies with location and depends 
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Table 2. Magnitude importance versus phase importance. 


Instrument | Magnitude importance Phase importance 
Organ 0.829 1.0 
Guitar 0.0 0.999 
Flute 0.092 1.0 
Bass 1.0 1.0 
Reed 0.136 1.0 
Vocal 1.0 1.0 
Mallet 0.005 0.217 
Brass 0.999 1.0 
Keyboard | 0.003 1.0 
String 1.0 0.0 


on physical factors such as line of sight, reflection, and diffraction. We con- 
sider the regression problem of estimating a function that assigns the proper 
signal strength to each location in the city. Our dataset D is RadioMapSeer [14] 
containing 700 maps, 80 Tx per map, and a corresponding grayscale label 
encoding the signal strength at every location. Our model @ receives as input 
x = [r©, ¢ 22], where «© is a binary map of the Tx locations, z® is a 
noisy binary map of the city (where a few buildings are missing), and x) is 
a grayscale image representing a number of ground truth measurements of the 
strength of the signal at the measured locations and zero elsewhere. We apply 
the UNet [13,14,22] architecture and train ® to output the estimation of the 
signal strength throughout the city that interpolates the input measurements. 

Apart from the model &, we also have a simpler model o , which only receives 
the city map and the Tx locations as inputs and is trained with unperturbed 
input city maps. This second model ®p will be deployed to inpaint measurements 
to input to ®. See Fig. 8a, 8b, and 8c for examples of a ground truth map and 
estimations for ® and Bo, respectively. 


Magnitude and Skip connection 
Phase Spectrum 


Binary Mask => 


Skip connection X 


Gaussian Noise =} 


Gaussian Noise ha 


Fig. 7. Diagram of the inpainting network for NSynth. 
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Table 3. Layer table of the Inpainting model for the NSynth task. 


Layer Filter size | Output shape # Params 
Convl1d-1 21 —1, 32, 1024] 4,736 
ReLU-2 —1, 32, 1024] 0 
Conv1d-3 21 —1, 64, 502] 43,072 
ReLU-4 —1, 64, 502] 0 
BatchNorm1d-5 —1, 64, 502] 128 
Conv1d-6 21 —1, 128, 241] | 172,160 
ReLU-7 —1, 128, 241] 0 
BatchNorm1d-8 —1, 128, 241] 256 
Conv1d-9 21 —1, 16, 112 43,024 
ReLU-10 —1, 16, 112 0 
BatchNorm1d-11 —1, 16, 112 32 
Conv Transposeld-12 21 —1, 64, 243 43,072 
ReLU-13 —1, 64, 243 0 
BatchNorm1d-14 —1, 64, 243 128 
Conv Transposeld-15 21 —1, 128, 505 172,160 
ReLU-16 —1, 128, 505 0 
BatchNorm1d-17 —1, 128, 505 256 
Conv Transposeld-18 20 —1, 64, 1024 163,904 
ReLU-19 —1, 64, 1024 0 
BatchNorm1d-20 —1, 64, 1024 128 
Skip Connection —1, 103, 1024] 0 
Convid-21 7 —1, 128, 1024] 92,416 
ReLU-22 —1, 128, 1024] 0 
Convi1d-23 7 —1, 2, 1024] 1,794 
ReLU-24 —1, 2, 1024] 0 
Total number of parameters 737,266 


4.3.1 Explaining Radio Map & 
Observe that in Fig. 8a there is a missing building in the input (the black one) 
and in Fig. 8b, & in-fills this building with a shadow. As a black box method, it 
is unclear why it made this decision. Did it rely on signal measurements or on 
building patterns? To address this, we consider each building as a cluster of pixels 
and each measurement as potential targets for our mask s = [s“), s@)], where 
s™ acts on buildings and s‘?) acts on measurements. We then apply matching 
pursuit (see Subsect.3.2.3) to find a minimal mask s of critical components 


(buildings and measurements). 
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To be precise, suppose we are given a target input signal x = |£% , 2, ¢)], 
Let kı denote the number of buildings in x“ and ky denote the number of 
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(a) Ground Truth (b) @ Estimation (c) Bo Estimation 


Fig. 8. Radio map estimations: The radio map (gray), input buildings (blue), and input 
measurements (red). (Color figure online) 


measurements in x2), Consider the function fı that takes as inputs vectors in 
{0, Y, which indicate the existence of buildings in x), and maps them to the 
corresponding city map in the original city map format. Analogously, consider 
the function fə that takes as input the measurements in R*? and maps them to 
the corresponding grayscale image of the original measurements format. Then, 
fı and fz encode the locations of the buildings and measurements in the target 
signal z = [x , fi(h™), fo(h)], where h® and h’?) denotes the building and 
measurement representation of x in fı and f2. When s“ has a zero entry, i.e., 
a building in h) was not selected, we replace the value in the obfuscation 
with zero (this corresponds to a constant perturbation equal to zero). Then, the 
obfuscation of the target signal x with a mask s = [s‘), s)] and perturbations 
v = [v®, v®] = [0, v] becomes: 


y = (2, fife Oh), fa(s OA? + (1-8) Ov), 


While it is natural to model masking out a building by simply zeroing out 
the corresponding cluster of pixels by choosing v® = 0, we need to also prop- 
erly choose vl?) for the entries, where the mask s? takes value 0, in order to 
obtain appropriate obfuscations. For this, we can deploy the second model ®o 
as an inpainter. We consider the following two extreme obfuscation strategies. 
The first is to set also v'?) to zero, i.e., simply remove the unchosen measure- 
ments from the input, with the underlying assumption being that any subset of 
measurements is valid for a city map. In the other extreme case, we inpaint all 
unchosen measurements by sampling at their locations the estimated radio map 
obtained by o based on the buildings selected by s“, 

The two extreme measurement completion methods correspond to two 
extremes of the interpretation query. Filling-in the missing measurements by Po 
tends to overestimate the strength of the signal because there are fewer buildings 
to obstruct the transmissions. The empty mask will complete all measurements 
to the maximal possible signal strength — the free space radio map. The overesti- 
mation in signal strength is reduced when more measurements and buildings are 
chosen, resulting in darker estimated radio maps. Thus, this strategy is related 
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to the query of which measurements and buildings are important to darken the 
free space radio map, turning it to the radio map produced by ®. In the other 
extreme, adding more measurements to the mask with a fixed set of buildings 
typically brightens the resulting radio map. This allows us to answer which mea- 
surements are most important for brightening the radio map. 

Between these two extreme strategies lies a continuum of completion meth- 
ods where a random subset of the unchosen measurements is sampled from $o, 
while the rest are set to zero. Examples of explanations of a prediction S(x) 
according to these methods are presented in Fig.9. Since we only care about 
specific small patches exemplified by the green boxes, the distortion here is mea- 
sured with respect to the Z2 distance between the output images restricted to 
the corresponding region (see also Example 8). 


(a) Estimated map. (b) Explanation: Inpaint (c) Explanation: Inpaint 
all unchosen measure- 2.5% of unchosen mea- 
ments. surements. 


Fig. 9. Radio map queries and explanations: The radio map (gray), input buildings 
(blue), input measurements (red), and area of interest (green box). Middle represents 
the query “How to fill in the image with shadows”, while right is the query “How to fill 
in the image both with shadows and bright spots?”. We inpaint with So. (Color figure 
online) 


When the query is how to darken the free space radio map (Fig.9), the 
optimized mask s suggests that samples in the shadow of the missing building are 
the most influential in the prediction. These dark measurements are supposed to 
be in line-of-sight of a Tx, which indicates that the network deduced that there is 
a missing building. When the query is how to fill in the image both with shadows 
and bright spots (Fig. 9c), both samples in the shadow of the missing building 
and samples right before the building are influential. This indicates that the 
network used the bright measurements in line-of-sight and avoided predicting an 
overly large building. To understand the chosen buildings, note that ® is based 
on a composition of UNets and is thus interpreted as a procedure of extracting 
high level and global information from the inputs to synthesize the output. The 
locations of the chosen buildings in Fig. 9 reflect this global nature. 
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4.3.2 Interpretation-Driven Training 

We now discuss an example application of the explanation obtained by the RDE 
approach described above, called interpretation driven training [23, 24,28]. When 
a missing building is in line-of-sight of a Tx, we would like ® to reconstruct this 
building relying on samples in the shadow of the building rather than patterns in 
the city. To reduce the reliance of ® on the city information in this situation, one 
can add a regularization term in the training loss which promotes explanations 
relying on measurements. Suppose x = [x , 2, ¢@)] contains a missing input 
building in line-of-sight of the Tx location and denote the subset of pixels of the 
missing building in the city map as Jy. Denote the prediction by ® restricted to 
the subset J, as ®;,. Moreover, define % := [a , 0, a] to be the modification 
of x with all input buildings masked out. We then define the interpretation loss 
for x as 


line (®, 2) = ||Gy, (2) — S3, (2) 


(a) Vanilla esti- (b) Interpretation- (c) Vanilla expla- (d) Interpretation- 
mation driven int estima- nation driven int explana- 
tion tion 


Fig. 10. Radio map estimations, interpretation driven training vs vanilla training: The 
radio map (gray), input buildings (blue), input measurements (red), and domain of the 
missing building (green box). (Color figure online) 


The interpretation driven training objective then regularizes ® during train- 
ing by adding the interpretation loss for all inputs x that contain a missing input 
building in line-of-sight of the Tx location. An example comparison between 
explanations of the vanilla RadioUNet @ and the interpretation driven network 
Bint is given in Fig. 10. 


5 Conclusion 


In this chapter, we presented the Rate-Distortion Explanation (RDE) frame- 
work in a revised and comprehensive manner. Our framework is flexible enough 
to answer various interpretation queries by considering suitable data represen- 
tations tailored to the underlying domain and query. We demonstrate the latter 
and the overall efficacy of the RDE framework on an image classification task, 
on an audio signal classification task, and on a radio map estimation task, a 
seldomly explored regression task. 
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Abstract. Unsupervised learning is a subfield of machine learning that 
focuses on learning the structure of data without making use of labels. 
This implies a different set of learning algorithms than those used for 
supervised learning, and consequently, also prevents a direct transposi- 
tion of Explainable AI (XAI) methods from the supervised to the less 
studied unsupervised setting. In this chapter, we review our recently pro- 
posed ‘neuralization-propagation’ (NEON) approach for bringing XAI 
to workhorses of unsupervised learning such as kernel density estima- 
tion and k-means clustering. NEON first converts (without retraining) 
the unsupervised model into a functionally equivalent neural network so 
that, in a second step, supervised X AI techniques such as layer-wise rel- 
evance propagation (LRP) can be used. The approach is showcased on 
two application examples: (1) analysis of spending behavior in wholesale 
customer data and (2) analysis of visual features in industrial and scene 
images. 


Keywords: Explainable AI - Unsupervised learning - Neural networks 


1 Introduction 


Supervised learning has been in the spotlight of machine learning research and 
applications for the last decade, with deep neural networks achieving record- 
breaking classification accuracy and enabling new machine learning applications 
(5, 15,23]. The success of deep neural networks can be attributed to their ability 
to implement with their multiple layers, complex nonlinear functions in a com- 
pact manner [32]. Recently, a significant amount of work has been dedicated to 
make deep neural network models more transparent [13,24,40,41], for example, 
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by proposing algorithms that identify which input features are responsible for a 
given classification outcome. Methods such as layer-wise relevance propagation 
(LRP) [3], guided backprop [47], and Grad-CAM [42], have been shown capable 
of quickly and robustly computing these explanations. 

Unsupervised learning is substantially different from supervised learning in 
that there is no ground-truth supervised signal to match. Consequently, non- 
neural network models such as kernel density estimation or k-means clustering, 
where the user controls the scale and the level of abstraction through a particular 
choice of kernel or feature representation, have remained highly popular. Despite 
the predominance of unsupervised machine learning in a variety of applications 
(e.g. [9,22]), research on explaining unsupervised models has remained relatively 
sparse [18,19,25,28,30] compared to their supervised counterparts. Paradoxi- 
cally, it might in fact be unsupervised models that most strongly require inter- 
pretability. Unsupervised models are indeed notoriously hard to quantitatively 
validate [51], and the main purpose of applying these models is often to better 
understand the data in the first place [9,17]. 

In this chapter, we review the ‘neuralization-propagation’ (NEON) approach 
we have developed in the papers [18-20] to make the predictions of unsupervised 
models, e.g. cluster membership or anomaly score, explainable. NEON proceeds 
in two steps: (1) the decision function of the unsupervised model is reformulated 
(without retraining) as a functionally equivalent neural network (i.e. it is ‘neu- 
ralized’); (2) the extracted neural network structure is then leveraged by the 
LRP method to produce an explanation of the model prediction. We review the 
application of NEON to kernel density estimation for outlier detection and k- 
means clustering, as presented originally in [18-20]. We also extend the reviewed 
work with a new contribution: explanation of inlier detection, and we use the 
framework of random features [36] for that purpose. 

The NEON approach is showcased on several practical examples, in particu- 
lar, the analysis of wholesale customer data, image-based industrial inspection, 
and analysis of scene images. The first scenario covers the application of the 
method directly to the raw input features, whereas the second scenario illus- 
trates how the framework can be applied to unsupervised models built on some 
intermediate layer of representation of a neural network. 


2 A Brief Review of Explainable AI 


The field of Explainable AI (XAI) has produced a wealth of explanation tech- 
niques and types of explanation. They address the heterogeneity of ML models 
found in applications and the heterogeneity of questions the user may formulate 
about the model and its predictions. An explanation may take the form of a sim- 
ple decision tree (or other intrinsically interpretable model) that approximates 
the model’s input-output relation [10,29]. Alternatively, an explanation may be 
a prototype for the concept represented at the output of the model, specifically, 
an input example to which the model reacts most strongly [34,45]. Lastly, an 
explanation may highlight what input features are the most important for the 
model’s predictions [3,4,7]. 
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In the following, we focus on a well-studied problem of XAI, which is how 
to attribute the prediction of an individual data point, to the input features 
(3,4, 29,37,45, 48,50]. Let us denote by ¥ = T; x--- x Tq the input space formed 
by the concatenation of d input features (e.g. words, pixels, or sensor measure- 
ments). We assume a learned model f : Y — R (supervised or unsupervised), 
mapping each data point in ¥ to a real-valued score measuring the evidence 
for a class or some other predicted quantity. The problem of attribution can be 
abstracted as producing for the given function f a mapping Ep : ¥ — R? that 
associates to each input example a vector of scores representing the (positive or 
negative) contribution of each feature. Often, one requires attribution techniques 
to implement a conservation (or completeness) property, where for all a € Vv 
we have 1' €;(a) = f(a) i.e. for every data point the sum of explanation scores 
over the input features should match the function value. 


2.1 Approaches to Attribution 


A first approach, occlusion-based, consists of testing the function to explain 
against various occlusions of the input features [53,54]. An important method of 
this family (and which was originally developed in the context of game theory) 
is the Shapley value [29,43,48]. The Shapley value identifies a unique attribu- 
tion that satisfies some predefined set of axioms of an explanation, including 
the conservation property stated above. While the approach has strong theoret- 
ical underpinnings, computing the explanation however requires an exponential 
number of function evaluations (an evaluation for every subset of input features). 
This makes the Shapley value in its basic form intractable for any problem with 
more than a few input dimensions. 

Another approach, gradient-based, leverages the gradient of the function, so 
that a mapping of the function value onto the multiple input dimensions is 
readily obtained [45,50]. The method of integrated gradients [50], in particular, 
attributes the prediction to input features by integrating the gradient along a 
path connecting some reference point (e.g. the origin) to the data point. The 
method requires somewhere between ten and a hundred function evaluations, 
and satisfies the aforementioned conservation property. The main advantage of 
gradient-based methods is that, by leveraging the gradient information in addi- 
tion to the function value, one no longer has to perturb each input feature 
individually to produce an explanation. 

A further approach, surrogate-based, consists of learning a simple local surro- 
gate model of the function which is as accurate as possible, and whose structure 
makes explanation fast and unambiguous [29,37]. For example, when approx- 
imating the function locally with a linear model, e.g. g(x) = + TiWi, the 
output of that linear model can be easily decomposed to the input features by 
taking the individual summands. While explanation itself is fast to compute, 
training the surrogate model incurs a significant additional cost, and further 
care must be taken to ensure that the surrogate model implements the same 
decision strategy as the original model, in particular, that it uses the same input 
features. 


120 G. Montavon et al. 


A last approach, propagation-based, assumes that the prediction has been 
produced by a neural network, and leverages the neural network structure by 
casting the problem of explanation as performing a backward pass in the net- 
work [3,42,47]. The propagation approach is embodied by the Layer-wise Rel- 
evance Propagation (LRP) method [3,31]. The backward pass implemented by 
LRP consists of a sequence of conservative propagation steps where each step 
is implemented by a propagation rule. Let j and k be indices for neurons at 
layer l and l + 1 respectively, and assume that the function output f(a) has 
been propagated from the top-layer to layer /+ 1. We denote the resulting attri- 
bution onto these neurons as the vector of ‘relevance scores’ (Rp). LRP then 
defines ‘messages’ 2; that redistribute the relevance Rẹ to neurons in the layer 
below. These messages typically have the structure Rjg = [zjk/ 20; zj] + Rk, 
where z;, models the contribution of neuron j to activating neuron k. The 
overall relevance of neuron j is then obtained by computing Rj = >>, Rjek- 
It is easy to show that application of LRP from one layer to the layer below 
is conservative. Consequently, the explanation formed by iterating the LRP 
propagation from the top layer to the input layer is therefore also conserva- 
tive, i.e. J); Ri = +++ = 0, Rj = DU, Re = ++: = f(x). As a result, explana- 
tions satisfying the conservation property can be obtained within a single for- 
ward/backward pass, instead of multiple function evaluations, as it was the case 
for the approaches described above. The runtime advantage of LRP facilitates 
explanation of large models and datasets (e.g. GPU implementations of LRP 
can achieve hundreds of image classification explanations per second [1,40]). 


2.2 Neuralization-Propagation 


Propagation-based explanation techniques such as LRP have a computational 
advantage over approaches based on multiple function evaluations. However, 
they assume a preexisting neural network structure associated to the prediction 
function. Unsupervised learning models such as kernel density estimation or k- 
means, are a priori not neural networks. However, the fact that these models are 
not given as neural networks does not preclude the existence of a neural network 
that implements the same function. If such a network exists (neural network 
equivalents of some unsupervised models will be presented in Sects. 3 and 4), we 
can quickly and robustly compute explanations by applying the following two 
steps: 


Step 1: The unsupervised model is ‘neuralized’, that is, rewritten (without 
retraining) as a functionally equivalent neural network. 

Step 2: The LRP method is applied to the resulting neural network, in order 
to produce an explanation of the prediction of the original model. 


These two steps are illustrated in Fig. 1. In practice, for the second step to 
work well, some restrictions must be imposed on the type of neurons composing 
the network. In particular neurons should have a clear directionality in their 
input space to ensure that meaningful propagation to the lower layer can be 


Explaining Unsupervised Learning Models 121 


unsupervised model : neural network equivalent —: explanation 


Ep (x) 


contribution of x, 
contribution of x 


neuralization propagation 


Fig. 1. Overview of the neuralization-propagation (NEON) approach to explain the 
predictions of an unsupervised model. As a first step, the unsupervised model is trans- 
formed without retraining into a functionally equivalent neural network. As a second 
step, the LRP procedure is applied to identify, with help of the neural network struc- 
ture, by what amount each input feature has contributed to a given prediction. 


achieved. (We will see in Sects. 3 and 4, that this requirement does not always 
hold.) Hence, the ‘neuralized model’ must be designed under the double con- 
straint of (1) replicating the decision function of the unsupervised model exactly, 
and (2) being composed of neurons that enable a meaningful redistribution from 
the output to the input features. 


3 Kernel Density Estimation 


Kernel density estimation (KDE) [35] is one of the most common methods for 
unsupervised learning. The KDE model (or variations of it) has been used, in 
particular, for anomaly detection [21,26,38]. It assumes an unlabeled dataset 
D = (u1,...,un), and a kernel, typically the Gaussian kernel K(a,a’) = 
exp(—7 ||a — a’ ||?). The KDE model predicts a new data point x by computing: 


1x 
Ble) = 57. exp(—7 lle — uel”). (1) 
k=1 


The function p(a) can be interpreted as an (unnormalized) probability density 
function. From this score, one can predict inlierness or outlierness of a data point. 
For example, one can say that a is more anomalous than 2’ if the inequality 
p(a) < p(x’) holds. In the following, we consider the task of neuralizing the KDE 
model so that its inlier/outlier predictions can be explained. 


3.1 Explaining Outlierness 


A first question to ask is why a particular example æ is predicted by KDE to be an 
outlier, more specifically, what features of this example contribute to outlierness. 
As a first step, we consider what is a suitable measure of outlierness. The function 
p(a) produced by KDE decreases with outlierness, and also saturates to zero even 
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though outlierness continues to grow. A better measure of outlierness is given 
by [19]: 
o(a) ê ~~ log ple), 
T 
Unlike the function p(x), the function o(a) increases as the probability decreases. 
It also does not saturate as x becomes more distant from the dataset. We now 


focus on neuralizing the outlier score o(a). We find that o(a) can be expressed 
as the two-layer neural network: 


hy, = lla — url? 


o(x) = LME; {hy} 


where LMEg{hz} = + log (4 S exp(a@hg)) is a generalized log-mean-exp 
pooling. The first layer computes the square distance of the new example from 
each point in the dataset. The second layer can be interpreted as a soft min- 
pooling. The structure of the outlier computation is shown for a one-dimensional 
toy example in Fig. 2. 


wo w 
O> G 


p 
B 


outlierness 


Fig. 2. Neuralized view of kernel density estimation for outlier prediction. The outlier 
function can be represented as a soft min-pooling over square distances. These distances 
also provide directionality in input space. 


This structure is particularly amenable to explanation. In particular, redis- 
tribution of o(æ) in the intermediate layer can be achieved by a soft argmin 
operation, e.g. 

exp(— 8h 

P x) -o(®), 

din exp(— bhy) 


where 8 is a hyperparameter to be selected. Then, propagation on the input 
features can leverage the geometry of the distance function, by computing 


R; = 


Ri=X ———— i Ry. 
2 = =: 


The hyperparameter € in the denominator is a stabilization term that ‘dissipates’ 
some of the relevance when x and ux coincide. 
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Referring back to Sect. 2.1 we want to stress that computing the relevance 
of input features with LRP has the same computational complexity as a single 
forward pass, and does not require to train an explainable surrogate model. 


3.2 Explaining Inlierness: Direct Approach 


In Sect. 3.1, we have focused on explaining what makes a given example an out- 
lier. An equally important question to ask is why a given example æ is predicted 
by the KDE model to be an inlier. Inlierness is naturally modeled by the KDE 
output p(x). Hence we can define the measure of inlierness as i(a) = p(x). An 
inspection of Eq. (1) suggests the following two-layer neural network: 


> 
bon 
l 


= exp(—7 |læ — uzl’) (layer 1) 
ia e ee (layer 2) 


The first layer performs a mapping on Gaussian functions at different locations, 
and the second layer performs an average pooling. We now consider the task of 
propagation. A natural way of redistributing in the top layer is in proportion to 
the activations. This gives us the scores 


hk 
D hy 


A decomposition of Ry on the input features is however difficult. Because the 
relevance R; can be rewritten as a product: 


Ry, = 


(x). 


d 
1 
Rk = N II exp(—7 (£i — uik)’) 


and observing that the contribution R, can be made nearly zero by perturbing 
any of the input features significantly, we can conclude that every input feature 
contributes equally to Ry; and should therefore be attributed an equal share of 
it. Application of this strategy for every neuron k would result in an uniform 
redistribution of the score i(a) to the input features. The explanation would 
therefore be qualitatively always the same, regardless of the data point x and 
the overall shape of the inlier function i(a). While uniform attribution may be 
a good baseline, we usually strive for a more informative explanation. 


3.3 Explaining Inlierness: Random Features Approach 


To overcome the limitations of the approach above, we explore a second app- 
roach to explaining inlierness, where the neuralization is based on a feature map 
representation of the KDE model. For this, we first recall that any kernel-based 
model also admits a formulation in terms of the feature map (x) associated to 
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the kernel, i.e. K(x, x’) = (G(x), &(x’)). In particular Eq. (1) can be equivalently 
rewritten as: 


1 N 
Bla) = (P(e), — > blu) ) (2) 


i.e. the product in feature space of the current example and the dataset mean. 
Here, we first recall that there is no explicit finite-dimensional feature map asso- 
ciated to the Gaussian kernel. However, such feature map can be approximated 
using the framework of random features [36]. In particular, for a Gaussian kernel, 
features can be sampled as 


P(x) = ve (cos(w a + Oo) ae (3) 


with wj ~ N (u, o°I) and b; ~ U(0,27), and where the mean and scale param- 
eters of the Gaussian are u = 0 and o = \/27. The dot product (G(x), B(x") 
converges to the Gaussian kernel as more and more features are being drawn. 
In practice, we settle for a fixed number H of features. Injecting the random 
features in Eq. (2) yields the two-layer architecture: 


hj = V2cos (w] æ + bj) + pj (layer 1) 
iw) = f Zj- hs (layer 2) 


where py = 4 Ry V2cos(w} ug+bj) and with (wj, bj); drawn from the distri- 
bution given above. This architecture produces at its output an approximation of 
the true inlierness score i(a) which becomes increasingly accurate as H becomes 
large. Here, the first layer is a detection layer with a cosine nonlinearity, and 
the second layer performs average pooling. The structure of the neural network 
computation is illustrated on our one-dimensional example in Fig. 3. 


inlierness 


Fig. 3. Kernel density estimation approximated with random features (four of them 
are depicted in the figure). Unlike the Gaussian kernel, random features have a clear 
directionality in input space, thereby enabling a feature-wise explanation. 


This structure of the inlierness computation is more amenable to explanation. 
In the top layer, the pooling operation can be attributed based on the summands. 
In order words, we can apply 
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a 


for the first step of redistribution of i(x). More importantly, in the first layer, 
the random features have now a clear directionality (given by the vectors (w,);), 
which we can use for attribution on the input features. In particular, we can apply 


the propagation rule: 
E lwil p. 
R; = > 20 Rj. 
~ |e 


Compared to the direct approach of Sect. 3.2, the explanation produced here 
assigns different scores for each input feature. Moreover, while the estimate of 
inlierness i(a) converges to the true KDE inlierness score i(a) as more random 
features are being drawn, we observe similar convergence for the explanation 


associated to the inlier prediction. 


4 K-Means Clustering 


Another important class of unsupervised models is clustering. K-means is a pop- 
ular algorithm for identifying clusters in the data. The k-means model represents 
each cluster c with a centroid p, E€ R corresponding to the mean of the cluster 
members. It assigns data onto clusters by first computing the distance between 
the data point and each cluster, e.g. 


d-(x) = ||a — Mell (4) 


and chooses the cluster with the lowest distance d.(a). Once the data has been 
clustered, it is often the case that we would like to gain understanding of why a 
given data point has been assigned to a particular cluster, either for validating 
a given clustering model or for getting novel insights on the cluster structure of 
the data. 


4.1 Explaining Cluster Assignments 


As a starting point for applying our explanation framework, we need to identify 
a function fe(x) that represents well the assignment onto a particular cluster c, 
e.g. a function that is larger than zero when the data point is assigned to a given 
cluster, and less than zero otherwise. 

The distance function d.(a) on which the clustering algorithm is based is how- 
ever not directly suitable for the purpose of explanation. Indeed, d.(a) tends to 
be inversely related to cluster membership, and it also does not take into account 
how far the data point is from other clusters. In [18], it is proposed to contrast 
the assigned cluster with the competing clusters. In particular, k-means cluster 
membership can be modeled as the difference of (squared) distances between the 
nearest competing cluster and the assigned cluster c: 


f(a) = min {d(2)} — d2(c) (5) 
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The paper [18] shows that this contrastive strategy results in a two-layer neural 
network. In particular, Eq. (5) can be rewritten as the two-layer neural network: 


hk = wl} £ + by (layer 1) 
OS yu {hk} (layer 2) 
where wp = 2(pe — Hp) and bp = || ugl? — ||u,||?. The first layer is a linear layer 


that depends on the centroid locations and provides a clear directionality in input 
space. The second layer is a hard min-pooling. Once the neural network structure 
of cluster membership has been extracted, we can proceed with explanation 
techniques such as LRP by first reverse-propagating cluster evidence in the top 
layer (contrasting the given cluster with all cluster competitors) and then further 
propagating in the layer below. In particular, we first apply the soft argmin 
redistribution 
exp(—(hr) 


~ J kze XxP(—Phe) 


where 8 is a hyperparameter to be selected. An advantage of the soft argmin 
over its hard counterpart is that this does not create an abrupt transition 
between nearest competing clusters, which would in turn cause nearly identical 
data points with the same cluster decision to result in a substantially different 
explanation. Finally, the last step of redistribution on the input features can be 
achieved by leveraging the orientation of linear functions in the first layer, and 
applying the redistribution rule: 


Rk 


Overall, these two redistribution steps provide us with a way of meaningfully 
attributing the cluster evidence onto the input features. 


5 Experiments 


We showcase the neuralization approaches presented above on two examples 
with two types of data: standard vector data representing wholesale customer 
spending behavior, and image data, more specifically, industrial inspection and 
scene images. 


5.1 Wholesale Customer Analysis 


Our first use case is the analysis of a wholesale customer dataset [11]. The 
dataset consists of 440 instances representing different customers, and for each 
instance, the annual consumption of the customer in monetary units (m.u.) for 
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the categories ‘fresh’, ‘milk’, ‘grocery’, ‘frozen’, ‘detergents /paper’, ‘delicatessen’ 
is given. Two additional geographic features are also part of this dataset, however 
we do not include them in our experiment. We will place our focus on two 
particular data points with feature values shown in the table below: 


Table 1. Excerpt of the Wholesale Customer Dataset [11] where we show feature 
values, expressed in monetary units (m.u.), for two instances as well as the average 
values over the whole dataset. 


Index | Fresh Milk Grocery | Frozen Detergents/ Delicatessen 
Paper 

338 | 935lm.u | 1347m.u | 2611m.u | 8170m.u 442 m.u 868 m.u. 

339 | 3m.u 333m.u | 7201m.u | 15601m.u 15m.u 550 m.u. 

AVG | 12000 m.u. | 5796 m.u. | 7951 m.u. | 3072 m.u. | 2881 m.u. 1525 m.u. 


Instance 338 has rather typical levels of spending across categories, in general 
slightly lower than average, but with high spending on frozen products. Instance 
339 has more extreme spending with almost no spending on fresh products and 
detergents and very high spending on frozen products. 

To get further insights into the data, we construct a KDE model on the 
whole data and apply our analysis to the selected instances. Each input feature 
is first mapped to the logarithm and standardized (mean 0 and variance 1). We 
choose the kernel parameter y = 1. We use a leave-one-out approach where the 
data used to build the KDE model is the whole data except the instance to be 
predicted and analyzed. The number of random features is set to H = 2500 such 
that the computational complexity of the inlier model stays within one order of 
magnitude to the original kernel model. Predictions on the whole dataset and 
analysis for the selected instances is shown in Fig. 4. 

Instance 338 is predicted to be an inlier, which is consistent with our initial 
observation that the levels of spending across categories are on the lower end but 
remain usual. We can characterize this instance as a typical small customer. We 
also note that the feature ‘frozen’ contributes less to inlierness according to our 
analysis, probably due to the spending on that category being unusually high 
for a typical small customer. 

Instance 339 has an inlierness score almost zero, which is consistent with the 
observation in Table 1 that spending behavior is extremal for multiple product 
categories. The decomposition of an inlierness score of almost zero on the dif- 
ferent categories is rather uninformative, hence, for this customer, we look at 
what explains outlierness (bottom of Fig.4). We observe as expected that cat- 
egories where spending behavior diverges for this instance are indeed strongly 
represented in the explanation of outlierness, with ‘fresh’, ‘milk’, ‘frozen’ and 
‘detergents/paper’ contributing almost all evidence for outlierness. Surprisingly, 
we observe that extremely low spending on ‘fresh’ is underrepresented in the 
outlierness score, compared to other categories such as ‘milk’ or ‘frozen’ where 
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Fig. 4. Explanation of different predictions on the Wholesale Customers Dataset. The 
dataset is represented on the left as a t-SNE plot (perplexity 100) and each data point 
is color-coded according to its predicted inlierness and outlierness. On the right, expla- 
nation of inlierness and outlierness in terms of input features for two selected instances. 
Large bars in the plot correspond to strongly contributing features. For explanation 
of inlierness, error bars are computed over 100 trials of newly drawn random features. 
(Color figure online) 


spending is less extreme. This apparent contradiction will be resolved by a cluster 
analysis. 

Using the same logarithmic mapping and standardization step as for the KDE 
model, we now train a k-means model on the data and set the number of clusters 
to 6. Training is repeated 10 times with different centroid initializations, and we 
retain the model that has reached the lowest k-means objective. The outcome 
of the clustering is shown in Fig. 5 (left). 

We observe that Instance 338 falls somewhere at the border between the 
green and red clusters, whereas Instance 339 is well into the yellow cluster at the 
bottom. The decomposition of cluster evidence for these two instances is shown 
on the right. Because Instance 338 is at the border between two clusters, there 
is no evidence of membership to one or another cluster, and the decomposition 
of such (lack of) evidence results in an explanation that is zero for all categories. 
The decomposition of the cluster evidence for Instance 339, however, reveals 
that its cluster membership is mainly due to a singular spending pattern on the 
category ‘fresh’. To shed further light into this decision, we look at the cluster 
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Fig. 5. On the left, a t-SNE representation of the Wholesale Customers Dataset, color- 
coded by cluster membership according to our k-means model, and where opacity 
represents evidence for the assigned cluster, i.e. how deep into its cluster the data 
point is. On the right, explanation of cluster assignments for two selected instances. 
(Color figure online) 


to which this instance has been assigned, in particular, the average spending of 
cluster members on each category. This information is shown in Table 2. 


Table 2. Average spending per category in the cluster to which Instance 339 has been 
assigned. 


Cluster Fresh Milk Grocery | Frozen Detergents/ | Delicatessen 
Paper 


Yellow |616 m.u. | 3176 m.u. | 6965 m.u. | 1523 m.u. | 1414 m.u. 135 m.u. 


We observe that this cluster is characterized by low spending on fresh prod- 
ucts and delicatessen. It may be a cluster of small retailers that, unlike super- 
markets, do not have substantial refrigeration capacity. Hence, the very low level 
of spending of Instance 339 on ‘fresh’ products puts it well into that cluster, and 
it also explains why the outlierness of Instance 339 is not attributed to ‘fresh’ but 
to other features (cf. Fig. 4). In particular, what distinguishes Instance 339 from 
its cluster is a very high level of spending on frozen products, and this is also 
the category that contributes the most to outlierness of this instance according 
to our analysis of the KDE model. 

Traditionally, cluster membership has been characterized by more basic 
approaches such as population statistics of individual features (e.g. [8]). Figure 6 
shows such analysis for Instances 338 and 339 of the Wholesale Customer 
Dataset. Although similar observations to the ones above can be made from 
this simple statistical analysis, e.g. the feature ‘frozen’ appears to contradict 
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Fig. 6. Population statistics of individual features for the 6 clusters. The black cross 
in Cluster 2 is Instance 338, the black cross in Cluster 4 is Instance 339. Features are 
mapped to the logarithm and standardized. 


the membership of Instance 339 to Cluster 4, it is not clear from this simple 
analysis what makes Instance 339 a member of Cluster 4 in the first place. For 
example, while the feature ‘grocery’ of Instance 339 is within the inter quartile 
range (IQR) of Cluster 4 and can therefore be considered typical of that cluster, 
other clusters have similar IQRs for that feature. Moreover, Instance 339 falls 
significantly outside Cluster 4’s IQR for other features. In comparison, our LRP 
approach more directly and reliably explains the cluster membership and outlier- 
ness of the considered instances. Furthermore, population statistics of individual 
features may be misleading on non-linear models (such as kernel clustering) and 
does not scale to high-dimensional data, such as image data. 

Overall, our analysis allows to identify on a single-instance basis features that 
contribute to various properties relating this instance to the rest of the data, such 
as inlierness/outlierness and cluster membership. As our analysis has revealed, 
the insights that are obtained go well beyond a traditional data analysis based 
on looking at population statistics for individual features, or a simple inspection 
of unsupervised learning outcomes. 


5.2 Image Analysis 


Our next experiment looks at explanation of inlierness, outlierness, and cluster 
membership for image data. Unlike the example above, relevant image statistics 
are better expressed at a more abstract level than directly on the pixels. A 
popular approach consists of using a pretrained neural model (e.g. the VGG-16 
network [46]), and use the activations produced at a certain layer as input. 

We first consider the problem of anomaly detection for industrial inspection 
and use for this an image of the MVTec AD dataset [6], specifically, an image 
of wood where an anomalous horizontal scratch can be observed. The image is 
shown in Fig. 7 (left). We feed that image to a pretrained VGG-16 network and 
collect the activations at the output of Block 5 (i.e. at the output of the feature 
extractor). We consider each spatial location at the output of that block as a 
data point and build a KDE model (with y = 0.05) on the resulting dataset. 
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We then apply our analysis to attribute the predicted inlierness/outlierness to 
the activations of Block 5. In practice, we need to consider the fact that any 
attribution on a deactivated neuron cannot be redistributed further to input 
pixels as there is no pattern in pixel space to attach to. Hence, the propagation 
procedure must be carefully implemented to address this constraint, possibly by 
only redistributing a limited share of the model output. The details are given in 
Appendix A. As a last step, we take relevance scores computed at the output of 
Block 5 and pursue the relevance propagation procedure in the VGG-16 network 
using standard LRP rules until the pixels are reached. Explanations obtained for 
inlierness and outlierness of the wood image of interest are shown in Fig. 7. 


input image inlierness outlierness 


roy 


Fig. 7. Exemplary image from the MVTec AD dataset along with the explanation 
of an inlier/outlier prediction of a KDE model built at the output of the VGG-16 
feature extractor. Red color indicates positively contributing pixels, blue color indicates 
negatively contributing pixels, and gray indicates irrelevant pixels. (Color figure online) 


It can be observed that pixels associated to regular wood stripes are the main 
contributors to inlierness. Instead, the horizontal scratch on the wood panel is 
a contributing factor for outlierness. Hence, with our explanation method, we 
can precisely identify, on a pixel-wise basis what are the factors that contribute 
for/against predicted inlierness and outlierness. 

We now consider some image of the SUN 2010 database [52], an indoor scene 
containing different pieces of furniture and home appliances. We consider the 
same VGG-16 network as in the experiment above and build a dataset by col- 
lecting activations at each spatial location of the output of Block 5. We then 
apply the k-means algorithm on this dataset with the number of clusters hard- 
coded to 5. Once the clustering model has been built, we rescale each cluster 
centroid to fixed norm. We then apply our analysis attribute the cluster mem- 
bership scores to the activations at the output of Block 5. As for the industrial 
inspection example above, we must adjust the LRP rules so that deactivated neu- 
rons are not attributed relevance. The details of the LRP procedure are given in 
Appendix A. Obtained relevance scores are then propagated further to the input 
pixels using standard LRP rules. Resulting explanations are shown in Fig. 8. 
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Fig. 8. Exemplary image and explanation of cluster assignments of a k-means model 
built at the output of the VGG-16 feature extractor. Red, blue and gray indicate pos- 
itively contributing, negatively contributing, and irrelevant pixels respectively. (Color 
figure online) 


We observe that different clusters identify distinct concepts. For example, one 
cluster focuses on the microwave oven and the surrounding cupboards, a second 
cluster represents the bottom part of the bar chairs, a third cluster captures the 
kitchen’s background with a particular focus on a painting on the wall, the fourth 
cluster captures various objects on the table and in the background, and a last 
cluster focuses on the top-part of the chairs. While the clustering representation 
extracts distinct human-recognizable image features, it also shows some limits 
of the given representation, for example, the concept ‘bar chair’ is split in two 
distinct concepts (the bottom and top part of the chair respectively), whereas 
the clutter attached to Cluster 4 is not fully disentangled from the surrounding 
chairs and cupboards. 

Overall, our experiments on image data demonstrate that neuralization of 
unsupervised learning models can be naturally integrated with existing proce- 
dures for explaining deep neural networks. This enables an application of our 
method to a broad range of practical problems where unsupervised modeling is 
better tackled at a certain level of abstraction and not directly in input space. 
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6 Conclusion and Outlook 


In this paper, we have considered the problem of explaining the predictions of 
unsupervised models, in particular, we have reviewed and extended the neu- 
ralization/propagation approach of [18,19] which consists of rewriting, without 
retraining, the unsupervised model as a functionally equivalent neural network, 
and applying LRP in a second step. On two models of interest, kernel density 
estimation and k-means, we have highlighted a variety of techniques that can be 
used for neuralization. This includes the identification of log-mean-exp pooling 
structures, the use of random features, and the transformation of a difference of 
(squared) distances into a linear layer. The capacity of our approach to deliver 
meaningful explanations was highlighted on two examples covering simple tab- 
ular data and images including their mapping on some layer of a convolutional 
network. 

While our approach delivers good quality explanations at low computational 
cost, there are however still a number of open questions that remain to be 
addressed to further solidify the neuralization-propagation approach, and the 
explanation of unsupervised models in general. 

A first question concerns the applicability of our method to a broader range 
of practical scenarios. We have highlighted how neuralized models can be built 
not only in input space but also on some layer of a deep neural network, thereby 
bringing explanations to much more complex unsupervised models. However, 
there is a higher diversity of unsupervised learning algorithms that are encoun- 
tered in practice, including energy-based models [16], spectral methods [33,44], 
linkage clustering [12], non-Euclidean methods [27], or prototype-based anomaly 
detection [14]. An important future work will therefore be to extend the pro- 
posed framework to handle this heterogeneity of unsupervised machine learning 
approaches. 

Another question is that of validation. There are many possible LRP prop- 
agation rules that one can define in practice, as well as potentially multiple 
neural network reformulations of the same unsupervised model. This creates a 
need for reliable techniques to evaluate the quality of different explanation meth- 
ods. While techniques to evaluate explanation quality have been proposed and 
successfully applied in the context of supervised learning (e.g. based on feature 
removal [39]), further care needs to be taken in the unsupervised scenario, in 
particular, to avoid that the outcome of the evaluation is spuriously affected 
by such feature removals. As an example, removing some feature responsible 
for some predicted anomaly may unintentionally cause some new artefact to be 
created in the data. That would in turn increase the anomaly score instead of 
lowering it as it was originally intended [19]. 

In addition to further extending and validating the neuralization-propagation 
approach, one needs to ask how to develop these explanation techniques beyond 
their usage as a simple visualization or data exploration tool. For example, it 
remains to demonstrate whether these explanation techniques, in combination 
with user feedback, can be used to systematically verify and improve the unsu- 
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pervised model at hand (e.g. as recently demonstrated for supervised models 
[2,49]). Some initial steps have already been taken in this direction [20,38]. 
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A Attribution on CNN Activations 


Propagation rules mentioned in Sects.3 and 4 are not suited for identifying 
relevant neurons at some layer of a neural network when the goal is to propagate 
the relevance further down the layers of the neural network, e.g. to obtain a pixel- 
wise explanation. What we need to ensure in such scenario is that all relevant 
information is expressed in terms of activated neurons as they are the only ones 
for which the associated relevance can be grounded to a specific pattern in the 
pixel space. One possible approach is to decompose the relevance propagation 
into a propagating term and a non-propagating (or ‘dissipating’) one, which 
leads to a partial (although still useful) explanation. In the following, we describe 
the approaches we have taken to achieve our extension of explanations to deep 
models. 


A.1 Attributing Outlierness 
The activations in the first layer of the neuralized outlier model is 


hy = a — up| |? 


and the relevance that arrives on the corresponding neuron is given by Ry = 
= É =Bh : 

PrLME,," {hy } with pk = se. Relevance associated to neuron k can 

be expressed as: 


Ry = Pk: a' (a = Uk) + Pk (uw, (uk — a) + LME; {he — hk}) 


TSS 
Root Ries 


where we have used the commutativity of the LME function and the distribu- 
tivity of the squared norm to decompose the relevance in two terms, one that 
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can be meaningfully redistributed on the activations, and one that cannot be 
redistributed. Redistribution in the first layer can then proceed as: 


(a; — Uik) dot 
Ri = LS Sa a — Re 


(a; aj Uik) 


It is easy to demonstrate from this equation that any neuron with a; = 0 (i.e. 
deactivated) will not be attributed any relevance. 


A.2 Attributing Inlierness 


Neurons in the first layer of the inlierness model based on random features, have 
activations given by: 7 
hy = v2 cos(w} a + bj): uj 


and relevance scores R; = h;/H. Using a simple trigonometric identity, we can 
rewrite the relevance scores in terms of unphased sine and cosine functions as: 


R; =(- sin(w) a a) sin(b;) + cj) + cos(w} a) cos(b;) + c; 
eS 
ae Revs 


where c; = v2uz. We propose the redistribution rule: 
QiWij sin QiWij cos 
R=) LR — I pe 
‘ 2o aig Xi way r2 Ej t+ Di iwig 7 


where €; is a term set to be of same sign as the denominator, and that addresses 
the case where a positive RS comes with a near-zero response w] a, by ‘dissi- 

p s p ja, by 
pating’ some of the relevance R3. 


A.3 Attributing Cluster Membership 


The activation in the first layer of the neuralized cluster membership model is: 
hk = wla + bk 


and the relevance score is given by Rk = ppkminpge{hk} with pe = 
exp(—Shx) oe ; i 
Lo eD Pho) Similar to the outlier case, we decompose the relevance score 
as: 
Ry = pk -a wy + pp (br + min{hy — hy}) 
~ k'£c 


dot 
Rk 


res 
Ry 


and only consider the first term for propagation. Specifically, we apply the prop- 
agation rule: 


aj Wik 
R= > doe 
S; diwir n 


where it can again be shown that only activated neurons are attributed relevance. 
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Abstract. Algorithmic recourse is concerned with aiding individuals 
who are unfavorably treated by automated decision-making systems to 
overcome their hardship, by offering recommendations that would result 
in a more favorable prediction when acted upon. Such recourse actions 
are typically obtained through solving an optimization problem that min- 
imizes changes to the individual’s feature vector, subject to various plau- 
sibility, diversity, and sparsity constraints. Whereas previous works offer 
solutions to the optimization problem in a variety of settings, they crit- 
ically overlook real-world considerations pertaining to the environment 
in which recourse actions are performed. 

The present work emphasizes that changes to a subset of the individ- 
ual’s attributes may have consequential down-stream effects on other 
attributes, thus making recourse a fundamcausal problem. Here, we 
model such considerations using the framework of structural causal mod- 
els, and highlight pitfalls of not considering causal relations through 
examples and theory. Such insights allow us to reformulate the opti- 
mization problem to directly optimize for minimally-costly recourse over 
a space of feasible actions (in the form of causal interventions) rather 
than optimizing for minimally-distant “counterfactual explanations”. We 
offer both the optimization formulations and solutions to deterministic 
and probabilistic recourse, on an individualized and sub-population level, 
overcoming the steep assumptive requirements of offering recourse in 
general settings. Finally, using synthetic and semi-synthetic experiments 
based on the German Credit dataset, we demonstrate how such methods 
can be applied in practice under minimal causal assumptions. 
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1 Introduction 


Predictive models are being increasingly used to support consequential decision- 
making in a number of contexts, e.g., denying a loan, rejecting a job appli- 
cant, or prescribing life-altering medication. As a result, there is mounting 
social and legal pressure [64,72] to provide explanations that help the affected 
individuals to understand “why a prediction was output”, as well as “how to 
act” to obtain a desired outcome. Answering these questions, for the different 
stakeholders involved, is one of the main goals of explainable machine learn- 
ing [15,19,32,37,42,53, 54]. 

In this context, several works have proposed to explain a model’s predictions 
of an affected individual using counterfactual explanations, which are defined as 
statements of “how the world would have (had) to be different for a desirable out- 
come to occur” [76]. Of specific importance are nearest counterfactual explana- 
tions, presented as the most similar instances to the feature vector describing the 
individual, that result in the desired prediction from the model [25,35]. A closely 
related term is algorithmic recourse—the actions required for, or “the system- 
atic process of reversing unfavorable decisions by algorithms and bureaucracies 
across a range of counterfactual scenarios” —which is argued as the underwriting 
factor for temporally extended agency and trust [70]. 

Counterfactual explanations have shown promise for practitioners and regu- 
lators to validate a model on metrics such as fairness and robustness [25, 58,69]. 
However, in their raw form, such explanations do not seem to fulfill one of the 
primary objectives of “explanations as a means to help a data-subject act rather 
than merely understand” [76]. 

The translation of counterfactual explanations to recourse actions, i.e., to a 
recommendable set of actions to help an individual achieve a favorable outcome, 
was first explored in [69], where additional feasibility constraints were imposed 
to support the concept of actionable features (e.g., to prevent asking the individ- 
ual to reduce their age or change their race). While a step in the right direction, 
this work and others that followed [25,41,49,58] implicitly assume that the set 
of actions resulting in the desired output would directly follow from the coun- 
terfactual explanation. This arises from the assumption that “what would have 
had to be in the past” (retrodiction) not only translates to “what should be in the 
future” (prediction) but also to “what should be done in the future” (recommenda- 
tion) [63]. We challenge this assumption and attribute the shortcoming of existing 
approaches to their lack of consideration for real-world properties, specifically the 
causal relationships governing the physical world in which actions are performed. 


1.1 Motivating Examples 


Example 1. Consider, for example, the setting in Fig. 1 where an individual has 
been denied a loan and seeks an explanation and recommendation on how to 
proceed. This individual has an annual salary (X,) of $75,000 and an account 
balance (X2) of $25,000 and the predictor grants a loan based on the binary 
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(ur) X= fi(U1) 
Xo := fo(X1,U2) M = (S, Pu) 
F) Pu = Pu, x Pv, 
OLI Ŷ = h(X1, X2) 


Fig. 1. Illustration of an example bivariate causal generative process, showing both 
the graphical model G (left), and the corresponding structural causal model (SCM) M 
(right) [45]. In this example, Xı represents an individual’s annual salary, X2 represents 
their bank balance, and Y denotes the output of a fixed deterministic predictor h, 
predicting an individual’s eligibility to receive a loan. U; and U2 denote unobserved 
(exogenous) random variables. 


output of h(X1, X2) = sgn(Xı + 5- Xə — $225,000). Existing approaches may 
identify nearest counterfactual explanations as another individual with an annual 
salary of $100,000 (+33%) or a bank balance of $30,000 (+20%), therefore 
encouraging the individual to reapply when either of these conditions are met. 
On the other hand, assuming actions take place in a world where home-seekers 
save 30% of their salary, up to external fluctuations in circumstance, (i.e., X2 = 
0.3X, + U2), a salary increase of only +14% to $85,000 would automatically 
result in $3,000 additional savings, with a net positive effect on the loan-granting 
algorithm’s decision. 


Example 2. Consider now another instance of the setting of Fig.1 in which an 
agricultural team wishes to increase the yield of their rice paddy. While many 
factors influence yield (temperature, solar radiation, water supply, seed quality, 
...), assume that the primary actionable capacity of the team is their choice of 
paddy location. Importantly, the altitude (X1) at which the paddy sits has an 
effect on other variables. For example, the laws of physics may imply that a 
100m increase in elevation results in an average decrease of 1°C in temperature 
(X2). Therefore, it is conceivable that a counterfactual explanation suggesting 
an increase in elevation for optimal yield, without consideration for downstream 
effects of the elevation increase on other variables (e.g., a decrease in tempera- 
ture), may actually result in the prediction not changing. 


These two examples illustrate the pitfalls of generating recourse actions 
directly from counterfactual explanations without consideration for the (causal) 
structure of the world in which the actions will be performed. Actions derived 
directly from counterfactual explanations may ask too much effort from the indi- 
vidual (Example 1) or may not even result in the desired output (Example 2). 

We also remark that merely accounting for correlations between features 
(instead of modeling their causal relationships) would be insufficient as this 
would not align with the asymmetrical nature of causal interventions: for Exam- 
ple 1, increasing bank balance (X2) would not lead to a higher salary (Xj), 
and for Example 2, increasing temperature (X2) would not affect altitude (X1), 
contrary to what would be predicted by a purely correlation-based approach. 
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1.2 Summary of Contributions and Structure of This Chapter 


In the present work, we remedy this situation via a fundamental reformulation 
of the recourse problem: we rely on causal reasoning (Sect. 2.2) to incorporate 
knowledge of causal dependencies between features into the process of recom- 
mending recourse actions that, if acted upon, would result in a counterfactual 
instance that favorably changes the output of the predictive model (Sect. 2.1). 

First, we illuminate the intrinsic limitations of an approach in which recourse 
actions are directly derived from counterfactual explanations (Sect.3.1). We 
show that actions derived from pre-computed (nearest) counterfactual explana- 
tions may prove sub-optimal in the sense of higher-than-necessary cost, or, even 
worse, ineffective in the sense of not actually achieving recourse. To address these 
limitations, we emphasize that, from a causal perspective, actions correspond to 
interventions which not only model changes to the intervened-upon variables, 
but also downstream effects on the remaining (non-intervened-upon) variables. 
This insight leads us to propose a new framework of recourse through mini- 
mal interventions in an underlying structural causal model (SCM) (Sect. 3.2). 
We complement this formulation with a negative result showing that recourse 
guarantees are generally only possible if the true SCM is known (Sect. 3.3). 

Second, since real-world SCMs are rarely known we focus on the problem 
of algorithmic recourse under imperfect causal knowledge (Sect. 4). We propose 
two probabilistic approaches which allow to relax the strong assumption of a 
fully-specified SCM. In the first (Sect. 4.1), we assume that the true SCM, while 
unknown, is an additive Gaussian noise model [23,47]. We then use Gaussian 
processes (GPs) [79] to average predictions over a whole family of SCMs to 
obtain a distribution over counterfactual outcomes which forms the basis for 
individualised algorithmic recourse. In the second (Sect. 4.2), we consider a dif- 
ferent subpopulation-based (i.e., interventional rather than counterfactual) notion 
of recourse which allows us to further relax our assumptions by removing any 
assumptions on the form of the structural equations. This approach proceeds by 
estimating the effect of interventions on individuals similar to the one for which 
we aim to achieve recourse (i.e., the conditional average treatment effect [1]), and 
relies on conditional variational autoencoders [62] to estimate the interventional 
distribution. In both cases, we assume that the causal graph is known or can be 
postulated from expert knowledge, as without such an assumption causal rea- 
soning from observational data is not possible [48, Prop. 4.1]. To find minimum 
cost interventions that achieve recourse with a given probability, we propose a 
gradient-based approach to solve the resulting optimisation problems (Sect. 4.3). 

Our experiments (Sect. 5) on synthetic and semi-synthetic loan approval data, 
show the need for probabilistic approaches to achieve algorithmic recourse in 
practice, as point estimates of the underlying true SCM often propose invalid 
recommendations or achieve recourse only at higher cost. Importantly, our results 
also suggest that subpopulation-based recourse is the right approach to adopt 
when assumptions such as additive noise do not hold. A user-friendly implemen- 
tation of all methods that only requires specification of the causal graph and a 
training set is available at https: //github.com/amirhk/recourse. 
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2 Preliminaries 


In this work, we consider algorithmic recourse through the lens of causality. We 
begin by reviewing the main concepts. 


2.1 XAI: Counterfactual Explanations and Algorithmic Recourse 


Let X = (Xj,...,Xa) denote a tuple of random variables, or features, taking 
values x = (21,...,%¢q) E€ ¥ = 4x... x Xy. Assume that we are given a 
binary probabilistic classifier h : Æ — [0,1] trained to make decisions about 
iid. samples from the data distribution Px.! 

For ease of illustration, we adopt the setting of loan approval as a running 
example, i.e., h(x) > 0.5 denotes that a loan is granted and h(x) < 0.5 that it is 
denied. For a given (“factual”) individual x" that was denied a loan, h(x*) < 0.5, 
we aim to answer the following questions: “Why did individual x" not get the 
loan?” and “What would they have to change, preferably with minimal effort, 
to increase their chances for a future application?” . 

A popular approach to this task is to find so-called (nearest) counterfactual 
explanations [76], where the term “counterfactual” is meant in the sense of the 
closest possible world with a different outcome [36]. Translating this idea to our 
setting, a nearest counterfactual explanation x°* for an individual x" is given 
by a solution to the following optimisation problem: 


xE € argmin dist(x,x") subject to h(x) > 0.5, (1) 
xEX 

where dist(-,-) is a distance on ¥ x X, and additional constraints may be added to 
reflect plausibility, feasibility, or diversity of the obtained counterfactual expla- 
nations [22,24,25,39,41,49,58]. Most existing approaches have focused on pro- 
viding solutions to (1) by exploring semantically meaningful choices of dist(-, -) 
for measuring similarity between individuals (e.g., Z0, l1, loo, percentile-shift), 
accommodating different predictive models h (e.g., random forest, multilayer 
perceptron), and realistic plausibility constraints P C 4.? 

Although nearest counterfactual explanations provide an understanding of 
the most similar set of features that result in the desired prediction, they stop 
short of giving explicit recommendations on how to act to realize this set of 
features. The lack of specification of the actions required to realize x** from x" 
leads to uncertainty and limited agency for the individual seeking recourse. To 


1 Following the related literature, we consider a binary classification task by conven- 

tion; most of our considerations extend to multi-class classification or regression 
settings as well though. 
In particular, [14,41,76] solve (1) using gradient-based optimization; [55,69] employ 
mixed-integer linear program solvers to support mixed numeric/binary data; [49] use 
graph-based shortest path algorithms; [35] use a heuristic search procedure by grow- 
ing spheres around the factual instance; [18,58] build on genetic algorithms for 
model-agnostic behavior; and [25] solve (1) using satisfiability solvers with close- 
ness guarantees. For a more complete exposition, see the recent surveys [26,71]. 


N 
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shift the focus from explaining a decision to providing recommendable actions 
to achieve recourse, Ustun et al. [69] reformulated (1) as: 


ô* €argmin cost*(6) subject to h(x’ +6)>0.5, x*+6€P, (2) 
OCF 

where cost*(-) is a user-specified cost function that encodes preferences between 
feasible actions from x", and F and P are optional sets of feasibility and plausi- 
bility constraints,’ restricting the actions and the resulting counterfactual expla- 
nation, respectively. The feasibility constraints in (2), as introduced in [69], aim 
at restricting the set of features that the individual may act upon. For instance, 
recommendations should not ask individuals to change their gender or reduce 
their age. Henceforth, we refer to the optimization problem in (2) as CFE-based 
recourse problem, where the emphasis is shifted from minimising a distance as 
in (1) to optimising a personalised cost function cost*(-) over a set of actions 6 
which individual x" can perform. 

The seemingly innocent reformulation of the counterfactual explanation prob- 
lem in (1) as a recourse problem in (2) is founded on two key assumptions. 


Assumption 1. The feature-wise difference between factual and nearest coun- 
terfactual instances, x™E — x", directly translates to minimal action sets 6*, such 
that performing the actions in 6* starting from x" will result in x°®. 


Assumption 2. There is a 1-1 mapping between dist(-,x*) and cost*(-), 
whereby more effortful actions incur larger distance and higher cost. 


Unfortunately, these assumptions only hold in restrictive settings, rendering 
solutions of (2) sub-optimal or ineffective in many real-world scenarios. Specif- 
ically, Assumption 1 implies that features X; for which 6* = 0 are unaffected. 
However, this generally holds only if (i) the individual applies effort in a world 
where changing a variable does not have downstream effects on other variables 
(i.e., features are independent of each other); or (ii) the individual changes the 
value of a subset of variables while simultaneously enforcing that the values 
of all other variables remain unchanged (i.e., breaking dependencies between 
features). Beyond the sub-optimality that arises from assuming/reducing to an 
independent world in (i), and disregarding the feasibility of non-altering actions 
in (ii), non-altering actions may naturally incur a cost which is not captured 
in the current definition of cost, and hence Assumption 2 does not hold either. 
Therefore, except in trivial cases where the model designer actively inputs pair- 
wise independent features (independently manipulable inputs) to the classifier h 
(see Fig. 2a), generating recommendations from counterfactual explanations in 
this manner, i.e., ignoring the potentially rich causal structure over X and the 
resulting downstream effects that changes to some features may have on others 
(see Fig. 2b), warrants reconsideration. A number of authors have argued for the 
need to consider causal relations between variables when generating counterfac- 
tual explanations [25,39,41,69,76], however, this has not yet been formalized. 


3 Here, “feasible” means possible to do, whereas “plausible” means possibly true, believ- 
able or realistic. Optimization terminology refers to both as feasibility sets. 
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(a) Classifier-centric view (b) Causal graph G for M 


Fig. 2. A view commonly adopted for counterfactual explanations (a) treats features 
as independently manipulable inputs to a given fixed and deterministic classifier h. 
In the causal approach to algorithmic recourse taken in this work, we instead view 
variables as causally related to each other by a structural causal model (SCM) M with 
associated causal graph G (b). 


2.2 Causality: Structural Causal Models, Interventions, 
and Counterfactuals 


To reason formally about causal relations between features X = (Xj,..., Xa), 
we adopt the structural causal model (SCM) framework [45].* Specifically, we 
assume that the data-generating process of X is described by an (unknown) 
underlying SCM M of the general form 


M = (S, Pu), S = {Xr:= Fae Uae Pu = Po, x... xX Pyg, 
(3) 
where the structural equations S are a set of assignments generating each 
observed variable X, as a deterministic function f, of its causal parents Xpa(r) C 
X \ X, and an unobserved noise variable U,. The assumption of mutually inde- 
pendent noises (i.e., a fully factorised Py) entails that there is no hidden con- 
founding and is referred to as causal sufficiency. An SCM is often illustrated 
by its associated causal graph G, which is obtained by drawing a directed edge 
from each node in Xpa(r) to Xr for r € [d] := {1,...,d}, see Fig. 1 and Fig. 2b 
for examples. We assume throughout that G is acyclic. In this case, M implies 
a unique observational distribution Px, which factorises over G, defined as the 
push-forward of Py via S.5 
Importantly, the SCM framework also entails interventional distributions 
describing a situation in which some variables are manipulated externally. E.g., 
using the do-operator, an intervention which fixes Xz to @ (where Z C [d]) is 
denoted by do(Xz = 0). The corresponding distribution of the remaining vari- 
ables X_z can be computed by replacing the structural equations for Xz in S 
to obtain the new set of equations $4°(% =). The interventional distribution 
Px_z\do(Xz=0) is then given by the observational distribution implied by the 
manipulated SCM (S?(%z=9), Py). 
* Also known as non-parametric structural equation model with independent errors. 
5 Le, for r € [d], Pxpixjagy(XrlXpar)) = Pu, (fr *(Xr|Xpary)), where 


J- (Xr|Xpa¢r)) denotes the pre-image of X, given Xpar) under fr, i.e., 
fx (Xr Bato) = {u € Ur : fr(Xpa(r), u) 5 Xr}. 
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Similarly, an SCM also implies distributions over counterfactuals— 
statements about a world in which a hypothetical intervention was performed all 
else being equal. For example, given observation x" we can ask what would have 
happened if Xz had instead taken the value 0. We denote the counterfactual 
variable by X(do(Xz = @))|x*, whose distribution can be computed in three 
steps [45]: 


1. Abduction: compute the posterior distribution Pujx" of the exogenous vari- 
ables U given the factual observation x; 

2. Action: perform the intervention do(Xz = 6) by replacing the struc- 
tural equations for Xz by Xz := 0 to obtain the new structural equations 
gdo(Xz = 8). 

3. Prediction: the counterfactual distribution Px (qo(xz =6))|x* is the distribu- 
tion induced by the resulting SCM (S= = , Pye). 


For instance, the counterfactual variable for individual xë had action a = 
do(Xz = 0) € F been performed would be X5% (a) := X(a)|x*. For a worked- 
out example of computing counterfactuals in SCMs, we refer to Sect. 3.2. 


3 Causal Recourse Formulation 


3.1 Limitations of CFE-Based Recourse 


Here, we use causal reasoning to formalize the limitations of the CFE-based 
recourse approach in (2). To this end, we first reinterpret the actions resulting 
from solving the CFE-based recourse problem, i.e., 6*, as structural interventions 
by defining the set of indices Z of observed variables that are intervened upon. 


Definition 1 (CFE-based actions). Given an individual x* in world M and 
a solution d* of (2), denote by T = {i | ô + 0} the set of indices of observed 
variables that are acted upon. A CFE-based action then refers to a set of struc- 
tural interventions of the form a@®(6*,x?) := do({ X; := af + ô hier). 


Using Definition 1, we can derive the following key results that provide nec- 
essary and sufficient conditions for CFE-based actions to guarantee recourse. 


Proposition 1. A CFE-based action a®¥(6*,x*) in general (i.e., for arbitrary 
underlying causal models) results in the structural counterfactual x8? = x@* := 
x” + 6* and thus guarantees recourse (i.e., h(x) # h(x*)) if and only if th 


set of descendants of the acted upon variables determined by T is the empty set. 


Corollary 1. Ifall features in the true world M are mutually independent, (i.e., 
if they are all root-nodes in the causal graph), then CFE-based actions always 
guarantee recourse. 
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While the above results are formally proven in Appendix A of [28], we provide 
a sketch of the proof below. If the intervened-upon variables do not have descen- 
dants, then by definition x5’ = x°F£, Otherwise, the value of the descendants 
will depend on the counterfactual value of their parents, leading to a structural 
counterfactual that does not resemble the nearest counterfactual explanation, 
xSCF 4 xCFE and thus may not result in recourse. Moreover, in an independent 
world the set of descendants of all the variables is by definition the empty set. 

Unfortunately, the independent world assumption is not realistic, as it 
requires all the features selected to train the predictive model h to be indepen- 
dent of each other. Moreover, limiting changes to only those variables without 
descendants may unnecessarily limit the agency of the individual, e.g., in Exam- 
ple 1, restricting the individual to only changing bank balance without e.g., pur- 
suing a new/side job to increase their income would be limiting. Thus, for a given 
non-independent M capturing the true causal dependencies between features, 
CFE-based actions require the individual seeking recourse to enforce (at least 
partially) an independent post-intervention model Me (so that Assumption 1 
holds), by intervening on all the observed variables for which 6; 4 0 as well as 
on their descendants (even if their 6; = 0). However, such requirement suffers 
from two main issues. First, it conflicts with Assumption 2, since holding the 
value of variables may still imply potentially infeasible and costly interventions 
in M to sever all the incoming edges to such variables, and even then it may 
be ineffective and not change the prediction (see Example 2). Second, as will be 
proven in the next section (see also, Example 1), CFE-based actions may still 
be suboptimal, as they do not benefit from the causal effect of actions towards 
changing the prediction. Thus, even when equipped with knowledge of causal 
dependencies, recommending actions directly from counterfactual explanations 
in the manner of existing approaches is not satisfactory. 


3.2 Recourse Through Minimal Interventions 


We have demonstrated that actions which immediately follow from counterfac- 
tual explanations may require unrealistic assumptions, or alternatively, result in 
sub-optimal or even infeasible recommendations. To solve such limitations we 
rewrite the recourse problem so that instead of finding the minimal (indepen- 
dent) shift of features as in (2), we seek the minimal cost set of actions (in the 
form of structural interventions) that results in a counterfactual instance yielding 
the favorable output from h. For simplicity, we present the formulation for the 
case of an invertible SCM (i.e., one with invertible structural equations S) such 
that the ground-truth counterfactual x8? = S*(S~1(x*)) is a unique point. The 
resulting optimisation formulation is as follows: 


a* €argmin cost*(a) subject to h(xS*(a)) > 0.5, 
acF (4) 
x®F (a) = x(a)|x® € P, 
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Xi 0) Xı = U 
Xə := U2 
O X3 := f3(X1, X2) + U3 M = (S, Pu) 
(Y) Kais fa(X3) + U4 
(uy l Pu = Pu, x Pu, x Pu, x Pu, 
OLJ Y = h (Xi, Xo, X3, X4) 


Fig. 3. The structural causal model (graph and equations) for the working example 
and demonstration in Sect. 3.2. 


where a* € F directly specifies the set of feasible actions to be performed for 
minimally costly recourse, with cost*(-).° 

Importantly, using the formulation in (4) it is now straightforward to show 
the suboptimality of CFE-based actions (proof in Appendix A of [28]): 


Proposition 2. Given an individual xë observed in world M, a set of feasible 
actions F, and a solution a* € F of (4), assume that there exists a CF E-based 
action a“#(6*,x*) € F (see Definition 1) that achieves recourse, i.e., h(x") # 
h(x**). Then, cost*(a*) < cost” (a®®). 


Thus, for a known causal model capturing the dependencies among observed 
variables, and a family of feasible interventions, the optimization problem in (4) 
yields Recourse through Minimal Interventions (MINT). Generating minimal 
interventions through solving (4) requires that we be able to compute the struc- 
tural counterfactual, x5°*, of the individual x" in world M, given any feasible 
action a € F. To this end, and for the purpose of demonstration, we consider a 
class of invertible SCMs, specifically, additive noise models (ANM) [23], where 
the structural equations S are of the form 


S= {Xr = fr(Xpa(r)) a Cn = uy. = a = Fr(Xpatr))> re [d], (5) 


and propose to use the three steps of structural counterfactuals in [45] to assign 
a single counterfactual x5™ (a) := x(a)|x* to each action a = do(Xz = 0) € F 
as below. 


Working Example. Consider the model in Fig. 3, where {U;}4_, are mutu- 
ally independent exogenous variables, and {f;}4_, are deterministic (linear or 


6 We note that, although x*8¥ := x(a*)|x* = S* (S~!(x*)) is a counterfactual 
instance, it does not need to correspond to the nearest counterfactual explanation, 
x*FE :— x" + 6*, resulting from (2) (see, e.g., Example 1). This further emphasizes 
that minimal interventions are not necessarily obtainable via pre-computed near- 
est counterfactual instances, and recourse actions should be obtained by solving (4) 
rather than indirectly through the solution of (2). 
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nonlinear) functions. Let x? = (xf, x5, £5, £f)" be the observed features belong- 
ing to the (factual) individual seeking recourse. Also, let Z denote the set of 
indices corresponding to the subset of endogenous variables that are intervened 
upon according to the action set a. Then, we obtain a structural counterfac- 
tual, x8°F(a) := x(a)|x¥ = S*(S~1(x*)), by applying the Abduction-Action- 
Prediction steps [46] as follows: 

Step 1. Abduction uniquely determines the value of all exogenous vari- 
ables U given the observed evidence X = x": 


F 
Ul = tį, 


F 
u2 = T3, 
F F OF (6) 
U3 = T3 — fa(zi, £3), 
Ug = T4- fala5). 
Step 2. Action modifies the SCM according to the hypothetical interven- 
tions, do({X; := ai}ier) (where a; = x? + 4;), yielding S*: 


Xı := [1 € Z] -a + [1 ¢ T|- U, 

Xə = [2 € T] - a2 + [2 ¢ T] - U2, 

X3 := [3 € T] - a3 + [3 ¢ T] - (f3(X1, X2) + U3), ) 
X4 = [4 € T] -a4 + [4 ¢ T] - (fa(X3) + Ua), 


where [-] denotes the Iverson bracket. 


Step 3. Prediction recursively determines the values of all endogenous 
variables based on the computed exogenous variables {u;}4_, from Step 1 and 
S° from Step 2, as: 


af = [1 € Z] -a1 + [1 ¢ T]: (u1), 
aS :— [2 € T] -az + [2 ¢ T] - (u2), 
SCF ._ SCF ,.SCF (8) 
T3 = [3 € T] az + [3 €Z]-(fs(24 T2 ) + us), 
aS := [4 € T] - a4 + [4 ¢ T] - (ala3) + ua) 


General Assignment Formulation for ANMs. As we have not made any 
restricting assumptions about the structural equations (only that we operate 
with additive noise models’ where noise variables are pairwise independent), the 
solution for the working example naturally generalizes to SCMs corresponding 
to other DAGs with more variables. The assignment of structural counterfactual 
values can generally be written as: 


T We remark that the presented formulation also holds for more general SCMs (for 
example where the exogenous variable contribution is not additive) as long as the 
sequence of structural equations S is invertible, i.e., there exists a sequence of equa- 
tions S~' such that x = S(S~1(x)) (in other words, the exogenous variables are 
uniquely identifiable via the abduction step). 
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ap = [i€]. (x; + 8:) + [i ¢ T]: (x; + fi(pay™) — fi(pa;)). (9) 


In words, the counterfactual value of the i-th feature, x3“, takes the value 2 +6; 
if such feature is intervened upon (i.e., i € Z). Otherwise, x8 is computed as 
a function of both the factual and counterfactual values of its parents, denoted 
respectively by f;(pa’) and f;(pa$“). The closed-form expression in (9) can 


replace the counterfactual constraint in (4), i.e., 
x87 (a) := x(a)|x* = S? (S7 (x")), 


after which the optimization problem may be solved by building on exist- 
ing frameworks for generating nearest counterfactual explanations, includ- 
ing gradient-based, evolutionary-based, heuristics-based, or verification-based 
approaches as referenced in Sect. 2.1. It is important to note that unlike CFE- 
based actions where the precise value of all covariates post-intervention are spec- 
ified, MINT-based actions require that the user focus only on the features upon 
which interventions are to be performed, which may better align with factors 
under the users control (e.g., some features may be non-actionable but mutable 
through changes to other features; see also [6]). 


3.3 Negative Result: No Recourse Guarantees for Unknown 
Structural Equations 


In practice, the structural counterfactual x(a) can only be computed using 
an approximate (and likely imperfect) SCM M = (S, Py), which is estimated 
from data assuming a particular form of the structural equation as in (5). How- 
ever, assumptions on the form of the true structural equations S, are generally 
untestable—not even with a randomized experiment—-since there exist multiple 
SCMs which imply the same observational and interventional distributions, but 
entail different structural counterfactuals. 


Example 3 (adapted from 6.19 in [48]). Consider the following two SCMs M4 
and Mp which arise from the general form in Fig.1 by choosing U;,U2 ~ 
Bernoulli(0.5) and U3 ~ Uniform({0,...,A}) independently in both Ma and 
M p, with structural equations 


Xı := Ui, in {M14, MB}, 
Xə = X1(1 — Up), in {Ma,MeB}, 
X3 := Lx, 4x, (lus>0X1 + luz=0X2) + Lx, =x,Us, in Ma, 


X3 := Lx, 4x, (lus>0%X1 + Ius=0X2) + Lx,=x,(K — U3), in Me. 


Then M 4 and Mpg both imply exactly the same observational and interventional 
distributions, and thus are indistinguishable from empirical data. However, hav- 
ing observed x" = (1,0,0), they predict different counterfactuals had X; been 
0, i.e., x8F(X, = 0) = (0,0,0) and (0,0, K), respectively.® 


8 This follows from abduction on xë = (1,0,0) which for both M4 and Mz implies 
U3 = 0. 
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Confirming or refuting an assumed form of S, would thus require counter- 
factual data which is, by definition, never available. Thus, Example 3 proves the 
following proposition by contradiction. 


Proposition 3 (Lack of Recourse Guarantees). Ifthe set of descendants of 
intervened-upon variables is non-empty, algorithmic recourse can be guaranteed 
in general (i.e., without further restrictions on the underlying causal model) only 
if the true structural equations are known, irrespective of the amount and type 
of available data. 


Remark 1. The converse of Proposition 3 does not hold. E.g., given x" = (1,0, 1) 
in Example 3, abduction in either model yields U3 > 0, so the counterfactual 
of X3 cannot be predicted exactly. 


Building on the framework of [28], we next present two novel approaches 
for causal algorithmic recourse under unknown structural equations. The first 
approach in Sect. 4.1 aims to estimate the counterfactual distribution under the 
assumption of ANMs (5) with Gaussian noise for the structural equations. The 
second approach in Sect. 4.2 makes no assumptions about the structural equa- 
tions, and instead of approximating the structural equations, it considers the 
effect of interventions on a sub-population similar to x". We recall that the 
causal graph is assumed to be known throughout. 


4 Recourse Under Imperfect Causal Knowledge 


4.1 Probabilistic Individualised Recourse 


Since the true SCM M, is unknown, one approach to solving (4) is to learn an 
approximate SCM M within a given model class from training data {x'}?_,. 
For example, for an ANM (5) with zero-mean noise, the functions f, can be 
learned via linear or kernel (ridge) regression of X, given Xpa(r) as input. We 
refer to these approaches as Mın and Mkr, respectively. M can then be used 
in place of M, to infer the noise values as in (5), and subsequently to predict 
a single-point counterfactual x8*(a) to be used in (4). However, the learned 
causal model M may be imperfect, and thus lead to wrong counterfactuals due 
to, e.g., the finite sample of the observed data, or more importantly, due to 
model misspecification (i.e., assuming a wrong parametric form for the structural 
equations). 

To solve such limitation, we adopt a Bayesian approach to account for the 
uncertainty in the estimation of the structural equations. Specifically, we assume 
additive Gaussian noise and rely on probabilistic regression using a Gaussian 
process (GP) prior over the functions fp; for an overview of regression with 
GPs, we refer to [79, § 2]. 


Definition 2 (GP-SCM). A Gaussian process SCM (GP-SCM) over X refers 
to the model 


Xr := Fr(Xpa(r)) + U,, fr~ GP(0, kr), Ur ~ N (0,02), TE [d], 
(10) 
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with covariance functions kr : Xpar) X Xpalr) > R, e.g., RBF kernels for con- 
tinuous X pa(r)- 


While GPs have previously been studied in a causal context for structure 
learning [16,73], estimating treatment effects [2,56], or learning SCMs with latent 
variables and measurement error [61], our goal here is to account for the uncer- 
tainty over fr in the computation of the posterior over U;, and thus to obtain a 
counterfactual distribution, as summarised in the following propositions. 


Proposition 4 (GP-SCM Noise Posterior). Let {x‘}”_, be an observa- 
tional sample from (10). For each r € |d] with non empty parent set |pa(r)| > 0, 


n 


the posterior distribution of the noise vector u, = (ul,...,u”™), conditioned on 
Xr = (2 uae) and Xpa(r) = (rate)! -o Xan) is given by 


U;|X pair) Xr ~N (02(K + 021)~'x,,07 (I — o?(K + 071)~")), (11) 


where K := (kr (ta Tatla denotes the Gram matriz. 

Next, in order to compute counterfactual distributions, we rely on ancestral 
sampling (according to the causal graph) of the descendants of the intervention 
targets Xz using the noise posterior of (11). The counterfactual distribution of 
each descendant X, is given by the following proposition. 


Proposition 5 (GP-SCM Counterfactual Distribution). Let {x'}"_, be 
an observational sample from (10). Then, forr € |d] with |pa(r)| > 0, the 
counterfactual distribution over X, had Xpar) been Xpa(r) (instead of 5 for 


individual x* € {x'}"_, is given by 
Xr(Xpa(r) = Xpa(r)) |X", {x i=l (12) 

~N (pe + k7(K + 071)7!x,, E+ k-k (K + glk), 
where k := kr(Špa(r) pate) ds k := (kr(Xpa(r)» Xpatr))? 7 -> kr (Špar) Xpat) Xp 
and K as defined in Proposition4, and uE and sÈ are the posterior mean and 
variance of uk given by (11). 


All proofs can be found in Appendix A of [27]. We can now generalise 
the recourse problem (4) to our probabilistic setting by replacing the single- 
point counterfactual x5™ (a) with the counterfactual random variable XS“ (a) := 
X(a)|x*. As a consequence, it no longer makes sense to consider a hard constraint 
of the form h(xS(a)) > 0.5, i.e., that the prediction needs to change. Instead, 
we can reason about the expected classifier output under the counterfactual 
distribution, leading to the following probabilistic version of the individualised 
recourse optimisation problem: 


min cost” (a) 
a=do(Xzr=0)EF (13) 


subject to Exscr(q) [h (X8*(a))] > thresh(a). 
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loan denied ~ ~ CATE, loan 
(y = 0) ` Š approved 
S T ay =1) 
M N x5 F 
xE aR 
~, boundary 
Sh@)= 0.5 


Fig. 4. Illustration of point- and subpopulation-based recourse approaches. 


Note that the threshold thresh(a) is allowed to depend on a. For example, an 
intuitive choice is 


thresh(a) = 0.5 + yucs y Varxse (a) [h (XS® (a))] (14) 


which has the interpretation of the lower-confidence bound crossing the decision 
boundary of 0.5. Note that larger values of the hyperparameter 7y,cp lead to a 
more conservative approach to recourse, while for 7,c3 = 0 merely crossing the 
decision boundary with > 50% chance suffices. 


4.2 Probabilistic Subpopulation-Based Recourse 


The GP-SCM approach in Sect. 4.1 allows us to average over an infinite number 
of (non-)linear structural equations, under the assumption of additive Gaussian 
noise. However, this assumption may still not hold under the true SCM, leading 
to sub-optimal or inefficient solutions to the recourse problem. Next, we remove 
any assumptions about the structural equations, and propose a second approach 
that does not aim to approximate an individualized counterfactual distribution, 
but instead considers the effect of interventions on a subpopulation defined by 
certain shared characteristics with the given (factual) individual x*. The key 
idea behind this approach resembles the notion of conditional average treatment 
effects (CATE) [1] (illustrated in Fig. 4) and is based on the fact that any inter- 
vention do(Xz = @) only influences the descendants d(Z) of the intervened-upon 
variables, while the non-descendants nd(Z) remain unaffected. Thus, when eval- 
uating an intervention, we can condition on Xyqiz) = Xhd(Z)? thus selecting a 
subpopulation of individuals similar to the factual subject. 

Specifically, we propose to solve the following subpopulation-based recourse 
optimization problem 


min cost” (a) 
a=do(Xzr =0)EF (15) 


[A(xha(z)> 0, Xan) ] > thresh(a), 


subject to Ex (,,|do(Xz = A sea 


where, in contrast to (13), the expectation is taken over the corresponding inter- 
ventional distribution. 
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In general, this interventional distribution does not match the conditional 
distribution, i.e., 
Px 4(q)|\do(Xz =0),x F Pxaml|Xz=0,x 


F F > 
nd(Z) nd(Z) 


because some spurious correlations in the observational distribution do not trans- 
fer to the interventional setting. For example, in Fig. 2b we have that 


Px, \do(X1 =21,X3 =23) = Pxaix,=01 # Pxa\Xi =21,X3 = "3° 


Fortunately, the interventional distribution can still be identified from the obser- 
vational one, as stated in the following proposition. 
Proposition 6. Subject to causal sufficiency, Px yq)|\do(Xz =0),x? is observa- 


nd(Z) 
tionally identifiable (i.e., computable from the observational distribution) via: 


p(X azy|do(Xz = 0), xam) = JI p (Xr|Xpacr)) 


d(T 
TEE) Xz =0, Xn) =X A(T) 


(16) 


As evident from Proposotion 6, tackling the optimization problem in (15) in 
the general case (i.e., for arbitrary graphs and intervention sets Z) requires esti- 
mating the stable conditionals Px,|xX aq) (a-k.a. causal Markov kernels) in order 
to compute the interventional expectation via (16). For convenience (see Sect. 4.3 
for details), here we opt for latent-variable implicit density models, but other 
conditional density estimation approaches may be also be used [e.g., 7, 10,68]. 
Specifically, we model each conditional p(x,|Xpa(,)) with a conditional variational 
autoencoder (CVAE) [62] as: 


P(Xr|Xpa(r)) S py, (Lr|Xpa(r)) = J po, rlpa(e), z) Pr dar p(Zr) := N (0, D). 

(17) 
To facilitate sampling x, (and in analogy to the deterministic mechanisms 
fr in SCMs), we opt for deterministic decoders in the form of neural nets 
D, parametrised by Wr, i.e., Py,(Er|Xpa(r) Zr) = (ar — Dr(Xpa(r), Zr; Yr), 
and rely on variational inference [77], amortised with approximate posteri- 
ors q¢,.(Zr|@r,Xpa(r)) parametrised by encoders in the form of neural nets with 
parameters ¢,. We learn both the encoder and decoder parameters by max- 
imising the evidence lower bound (ELBO) using stochastic gradient descend 
{11,30,31,50]. For further details, we refer to Appendix D of [27] 


Remark 2. The collection of CVAEs can be interpreted as learning an approxi- 
mate SCM of the form 


Movas: S= {X := D, (Xpat) Zr; Yr) a3 Zr ~ N (0, I) Vr € [d] (18) 


However, this family of SCMs may not allow to identify the true SCM (pro- 
vided it can be expressed as above) from data without additional assumptions. 
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Moreover, exact posterior inference over z, given x" is intractable, and we need 
to resort to approximations instead. It is thus unclear whether sampling from 
qo, (Zr|zE, ate) instead of from p(z,) in (17) can be interpreted as a counter- 
factual within (18). For further discussion on such “pseudo-counterfactuals” we 
refer to Appendix C of [27] 


4.3 Solving the Probabilistic Recourse Optimization Problem 


We now discuss how to solve the resulting optimization problems in (13) and 
(15). First, note that both problems differ only on the distribution over which the 
expectation in the constraint is taken: in (13) this is the counterfactual distribu- 
tion of the descendants given in Propostion 5; and in (15) it is the interventional 
distribution identified in Propostion 6. In either case, computing the expectation 
for an arbitrary classifier h is intractable. Here, we approximate these integrals 
via Monte Carlo by sampling aie from the interventional or counterfactual 
distributions resulting from a = do(Xz = 9), i.e., 


M 
ni 1 mM 
4Xa(z)I\o [h(xiaa 9, Xaz)) | = M 5 ETRE 


m=ł 


Brute-Force Approach. A way to solve (13) and (15) is to (i) iterate over 
a € F, with F being a finite set of feasible actions (possibly as a result of dis- 
cretizing in the case of a continuous search space); (ii) approximately evaluate 
the constraint via Monte Carlo ; and (iii) select a minimum cost action amongst 
all evaluated candidates satisfying the constraint. However, this may be compu- 
tationally prohibitive and yield suboptimal interventions due to discretisation. 


Gradient-based Approach. Recall that, for actions of the form a = do(Xz = 
0), we need to optimize over both the intervention targets T and the interven- 
tion values 0. Selecting targets is a hard combinatorial optimization problem, as 
there are 2%” possible choices for d’ < d actionable features, with a potentially 
infinite number of intervention values. We therefore consider different choices 
of targets Z in parallel, and propose a gradient-based approach suitable for dif- 
ferentiable classifiers to efficiently find an optimal @ for a given intervention 
set Z.° In particular, we first rewrite the constrained optimization problem in 
unconstrained form with Lagrangian [29,33]: 


L(0, A) := cost? (a) + \(thresh(a) — ey [h(xhaczy: 6,Xaz))])- (19) 


We then solve the saddle point problem ming max, £(0, A) arising from (19) 
with stochastic gradient descent [11,30]. Since both the GP-SCM counterfac- 


° For large d when enumerating all Z becomes computationally prohibitive, we can 
upper-bound the allowed number of variables to be intervened on simultaneously 
(e.g., [Z| < 3), or choose a greedy approach to select Z. 
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tual (12) and the CVAE interventional distributions (17) admit a reparametriza- 
tion trick [31,50], we can differentiate through the constraint: 


VoExacx [h(Xnaz)> 9; Xa(zy) | = Ezxw 0,5 [Voh (xaa) 9, Xa(zy(2))]- (20) 


Here, Xa(z) (z) is obtained by iteratively computing all descendants in topological 
order: either substituting z together with the other parents into the decoders D, 
for the CVAEs, or by using the Gaussian reparametrization x,(z) = u + oz 
with u and o given by (12) for the GP-SCM. A similar gradient estimator for 
the variance which enters thresh(a) for y,cy 4 0 is derived in Appendix F of [27]. 


5 Experiments 


In our experiments, we compare different approaches for causal algorithmic 
recourse on synthetic and semi-synthetic data sets. Additional results can be 
found in Appendix B of [27]. 


5.1 Compared Methods 


We compare the naive point-based recourse approaches Mim and Mkr men- 
tioned at the beginning of Sect. 4.1 as baselines with the proposed counterfac- 
tual GP-SCM Mep and the CVAE approach for sub-population-based recourse 
(CATEcyaz)- For completeness, we also consider a CATEcp approach as a GP can 
also be seen as modelling each conditional as a Gaussian,!° and also evaluate 
the “pseudo-counterfactual” Movas approach discussed in Remark2. Finally, 
we report oracle performance for individualised M, and sub-population-based 
recourse methods CATE, by sampling counterfactuals and interventions from 
the true underlying SCM. We note that a comparison with non-causal recourse 
approaches that assume independent features [58,69] or consider causal rela- 
tions to generate counterfactual explanations but not recourse actions [24,39] is 
neither natural nor straight-forward, because it is unclear whether descendant 
variables should be allowed to change, whether keeping their value constant 
should incur a cost, and, if so, how much, c.f. [28]. 


5.2 Metrics 


We compare recourse actions recommended by the different methods in terms 
of cost, computed as the L2-norm between the intervention Oz and the factual 
value xf, normalised by the range of each feature r € Z observed in the train- 
ing data; and validity, computed as the percentage of individuals for which the 
recommended actions result in a favourable prediction under the true (oracle) 
SCM. For our probabilistic recourse methods, we also report the lower confi- 
dence bound LCB := E[h] — Y%ıcs y Var[h] of the selected action under the given 
method. 


10 Sampling from the noise prior instead of the posterior in (11) leads to an interven- 
tional distribution in (12). 
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Table 1. Experimental results for the gradient-based approach on different 3-variable 


SCMs. We show average performance +1 standard deviation for Nruns 100, 
Nmc-samples = 100, and Tiom = 2. 

Method | LINEAR SCM NON-LINEAR ANM NON-ADDITIVE SCM 

Valid, (%) | LCB Cost (%) | Valid, (%) | LCB Cost (%) | Valid, (%) | LCB Cost (%) 

M, 100 S 0.97.9 | 100 = 20.1 + 12.3 | 100 = 13.2+ 11.0 
Maun 100 = 1.07.0 | 54 = 20.6 + 11.0 | 98 = 14.0+ 13.5 
Mkr 90 = 0.7+6.5 | 91 = 20.64 12.5 70 = 13.2+ 11.6 
Mer 100 55+ .04 | 12.2+8.3 100 54+ .03 | 21.9 2.9) 95 .52+ .04 | 13.4 2.8 
Movar 100 .55 + .07 | 11.8 7.7 | 97 54+ .05 | 22.6 2.3 | 95 51+ .01 | 13.4 2.2 
CATE, 90 56.07 | 11.9+9.2 | 97 55+ .05 | 26.3 + 21.4 | 100 .52 + .02 | 13.5 3.0 
CATEcp |93 56+ .05 | 12.2 £8.4 94 -55 + .06 | 25.0 4.8 | 94 .52 + .03 | 13.2 3.1 
CATEcvar | 89 -56 + .08 | 12.1 £8.9 | 98 54+ .05 | 26.0 + 14.3 | 100 .52 + .05 | 13.6 2.9 


5.3 Synthetic 3-Variable SCMs Under Different Assumptions 


In our first set of experiments, we consider three classes of SCM s over three 
variables with the same causal graph as in Fig.2b. To test robustness of the 
different methods to assumptions about the form of the true structural equations, 
we consider a linear SCM, a non-linear ANM, and a more general, multi-modal 
SCM with non-additive noise. For further details on the exact form we refer to 
Appendix E of [27]. 

Results are shown in Tablel we observe that the point-based recourse 
approaches perform (relatively) well in terms of both validity and cost, when 
their underlying assumptions are met (i.e., Mim on the linear SCM and Mx 
on the nonlinear ANM). Otherwise, validity significantly drops as expected (see, 
e.g., the results of Mın on the non-linear ANM, or of Mx, on the non-additive 
SCM). Moreover, we note that the inferior performance of Mkr compared to 
Maın on the linear SCM suggests an overfitting problem, which does not occur 
for its more conservative probabilistic counterpart Mep. Generally, the individ- 
ualised approaches Mep and Movar perform very competitively in terms of cost 
and validity, especially on the linear and nonlinear ANMs. The subpopulation- 
based CATE approaches on the other hand, perform particularly well on the 
challenging non-additive SCM (on which the assumptions of GP approaches are 
violated) where CATEcvar achieves perfect validity as the only non-oracle method. 
As expected, the subpopulation-based approaches generally lead to higher cost 
than the individualised ones, since the latter only aim to achieve recourse only 
for a given individual while the former do it for an entire group (see Fig. 4). 


5.4 Semi-synthetic 7-Variable SCM for Loan-Approval 


We also test our methods on a larger semi-synthetic SCM inspired by the German 
Credit UCI dataset [43]. We consider the variables age A, gender G, education- 
level E, loan amount L, duration D, income J, and savings S$ with causal 
graph shown in Fig.5. We model age A, gender G and loan duration D as 
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Fig. 5. Assumed causal graph for the semi-synthetic loan approval dataset. 


Table 2. Experimental results for the 7-variable SCM for loan-approval. We show 
average performance +1 standard deviation for Nruns = 100, Nuc-samples = 100, 
and Jcn = 2.5. For linear and non-linear logistic regression as classifiers, we use the 
gradient-based approach, whereas for the non-differentiable random forest classifier we 
rely on the brute-force approach (with 10 discretised bins per dimension) to solve the 
recourse optimisation problems. 


Method | LINEAR LOG. REGR. NON-LIN. LOG. REGR. (MLP RANDOM FOREST(BRUTE-FORCE) 
Valid, (%) | LCB Cost (%) | Valid, (%) | LCB Cost (%) | Valid, (%) | LCB Cost (%) 
M, 100 = 15.8+ 7.6 100 = 1.07.0 | 100 = 15.2+7.5 
Mu 19 = 15.4 7.4 |80 = 1046.9 | 94 = 15.64 7.6 
Mkr 41 = 15.6 7.5 |87 = 1.17.0 |92 = 15.147.4 
Mer 100 50+ .00 | 18.0 7.7 | 100 .52 + .04| 11.7 7.3 | 100 664.14) 16.3 7.4 
Movar 100 .50 .00 | 16.6 7.6 |99 .51 + .01 | 11.3 + 6.9 | 100 -66 +.14 | 15.9 7.4 
CATE, 93 50+ .01 | 22.0 + 9.4 |95 .52 + .05 | 12.0 + 7.7 | 98 664.15 | 17.0 7.3 
CATEcp |93 .50 + .02 | 21.7 + 9.2 |93 .51 + .06 | 12.0 + 7.4 | 100 .67.15| 17.1 7.4 
CATEcvar | 94 .49 + .01 | 23.7 +11.3 | 95 .51 + .03 | 12.0 + 7.8 | 100 -68 +.15 | 17.9 7.4 


non-actionable variables, but consider D to be mutable, i.e., it cannot be manip- 
ulated directly but is allowed to change (e.g., as a consequence of an intervention 
on L). The SCM includes linear and non-linear relationships, as well as differ- 
ent types of variables and noise distributions, and is described in more detail in 
Appendix B of [27]. 

The results are summarised in Table 2, where we observe that the insights 
discussed above similarly apply for data generated from a more complex SCM, 
and for different classifiers. 

Finally, we show the influence of y,cg on the performance of the proposed 
probabilistic approaches in Fig. 6. We observe that lower values of 7,cp lead to 
lower validity (and cost), especially for the CATE approaches. As 7,cp increases 
validity approaches the corresponding oracles M, and CATE,, outperforming 
the point-based recourse approaches. In summary, our probabilistic recourse 
approaches are not only more robust, but also allow controlling the trade-off 
between validity and cost using Yos- 
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Fig. 6. Trade-off between validity and cost which can be controlled via yLcp for the 
probabilistic recourse methods. 


6 Discussion 


In this paper, we have focused on the problem of algorithmic recourse, i.e., the 
process by which an individual can change their situation to obtain a desired 
outcome from a machine learning model. Using the tools from causal reasoning 
(i.e., structural interventions and counterfactuals), we have shown that in their 
current form, counterfactual explanations only bring about agency for the indi- 
vidual to achieve recourse in unrealistic settings. In other words, counterfactual 
explanations imply recourse actions that may neither be optimal nor even result 
in favorably changing the prediction of h when acted upon. This shortcoming is 
primarily due to the lack of consideration of causal relations governing the world 
and thus, the failure to model the downstream effect of actions in the predic- 
tions of the machine learning model. In other words, although “counterfactual” 
is a term from causal language, we observed that existing approaches fall short 
in terms of taking causal reasoning into account when generating counterfac- 
tual explanations and the subsequent recourse actions. Thus, building on the 
statement by Wachter et al. [76] that counterfactual explanations “do not rely 
on knowledge of the causal structure of the world,” it is perhaps more appro- 
priate to refer to existing approaches as contrastive, rather than counterfactual, 
explanations [14,40]. See [26, §2] for more discussion. 

To directly take causal consequences of actions into account, we have pro- 
posed a fundamental reformulation of the recourse problem, where actions are 
performed as interventions and we seek to minimize the cost of performing 
actions in a world governed by a set of (physical) laws captured in a struc- 
tural causal model. Our proposed formulation in (4), complemented with several 
examples and a detailed discussion, allows for recourse through minimal inter- 
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ventions (MINT), that when performed will result in a structural counterfactual 
that favourably changes the output of the model. 

The primary limitation of this formulation in (4) is its reliance on the true 
causal model of the world, subsuming both the graph, and the structural equa- 
tions. In practice, the underlying causal model is rarely known, which suggests 
that the counterfactual constraint in (4), i.e., x8¥(a) := x(a)|x¥ = S7(S~!(x)), 
may not be (deterministically) identifiable. As negative result, however, we 
showed that algorithmic recourse cannot be guaranteed in the absence of perfect 
knowledge about the underlying SCM governing the world, which unfortunately 
is not available in practice. To address this limitation, we proposed two prob- 
abilistic approaches to achieve recourse under more realistic assumptions. In 
particular, we derived i) an individual-level recourse approach based on GPs 
that approximates the counterfactual distribution by averaging over the fam- 
ily of additive Gaussian SCMs; and ii) a subpopulation-based approach, which 
assumes that only the causal graph is known and makes use of CVAEs to estimate 
the conditional average treatment effect of an intervention on a subpopulation 
of individuals similar to the one seeking recourse. Our experiments showed that 
the proposed probabilistic approaches not only result in more robust recourse 
interventions than approaches based on point estimates of the SCM, but also 
allows to trade-off validity and cost. 


Assumptions, Limitations, and Extensions. Throughout the present work, 
we have assumed a known causal graph and causal sufficiency. While this may not 
hold for all settings, it is the minimal necessary set of assumptions for causal rea- 
soning from observational data alone. Access to instrumental variables or exper- 
imental data may help further relax these assumptions [3, 13,66]. Moreover, if 
only a partial graph is available or some relations are known to be confounded, 
one will need to restrict recourse actions to the subset of interventions that are 
still identifiable [59,60,67]. An alternative approach could address causal suffi- 
ciency violations by relying on latent variable models to estimate confounders 
from multiple causes [78] or proxy variables [38], or to work with bounds on 
causal effects instead [5, 65, 74]. 

Perhaps more concerningly, our work highlights the implicit causal assump- 
tions made by existing approaches (i.e., that of independence, or feasible and 
cost-free interventions), which may portray a false sense of recourse guarantees 
where one does not exists (see Example 2 and all of Sect. 3.1). Our work aims 
to highlight existing imperfect assumptions, and to offer an alternative formu- 
lation, backed with proofs and demonstrations, which would guarantee recourse 
if assumptions about the causal structure of the world were satisfied. Future 
research on causal algorithmic recourse may benefit from the rich literature in 
causality that has developed methods to verify and perform inference under 
various assumptions [45,48]. 

This is not to say that counterfactual explanations should be abandoned 
altogether. On the contrary, we believe that counterfactual explanations hold 
promise for “guided audit of the data” [76] and evaluating various desirable 
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model properties, such as robustness [21,58] or fairness [20, 25, 58,69, 75]. Besides 
this, it has been shown that designers of interpretable machine learning systems 
use counterfactual explanations for predicting model behavior [34] or uncover- 
ing inaccuracies in the data profile of individuals [70]. Complementing these 
offerings of counterfactual explanations, we offer minimal interventions as a way 
to guarantee algorithmic recourse in general settings, which is not implied by 
counterfactual explanations. 


On the Counterfactual vs Interventional Nature of Recourse. Given 
that we address two different notions of recourse—counterfactual/individualised 
(rung 3) vs. interventional/subpopulation-based (rung 2)—one may ask which 
framing is more appropriate. Since the main difference is whether the background 
variables U are assumed fixed (counterfactual) or not (interventional) when rea- 
soning about actions, we believe that this question is best addressed by thinking 
about the type of environment and interpretation of U: if the environment is 
static, or if U (mostly) captures unobserved information about the individual, 
the counterfactual notion seems to be the right one; if, on the other hand, U 
also captures environmental factors which may change, e.g., between consecutive 
loan applications, then the interventional notion of recourse may be more appro- 
priate. In practice, both notions may be present (for different variables), and the 
proposed approaches can be combined depending on the available domain knowl- 
edge since each parent-child causal relation is treated separately. We emphasise 
that the subpopulation-based approach is also practically motivated by a reluc- 
tance to make (parametric) assumptions about the structural equations which 
are untestable but necessary for counterfactual reasoning. It may therefore be 
useful to avoid problems of misspecification, even for counterfactual recourse, as 
demonstrated experimentally for the non-additive SCM. 


7 Conclusion 


In this work, we explored one of the main, but often overlooked, objectives of 
explanations as a means to allow people to act rather than just understand. 
Using counterexamples and the theory of structural causal models (SCM), we 
showed that actionable recommendations cannot, in general, be inferred from 
counterfactual explanations. We show that this shortcoming is due to the lack 
of consideration of causal relations governing the world and thus, the failure 
to model the downstream effect of actions in the predictions of the machine 
learning model. Instead, we proposed a shift of paradigm from recourse via 
nearest counterfactual explanations to recourse through minimal interventions 
(MINT), and presented a new optimization formulation for the common class 
of additive noise models. Our technical contributions were complemented with 
an extensive discussion on the form, feasibility, and scope of interventions in 
real-world settings. In follow-up work, we further investigated the epistemolog- 
ical differences between counterfactual explanations and consequential recom- 
mendations and argued that their technical treatment requires consideration at 
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different levels of the causal history [52] of events [26]. Whereas MINT pro- 
vided exact recourse under strong assumptions (requiring the true SCM), we 
next explored how to offer recourse under milder and more realistic assump- 
tions (requiring only the causal graph). We present two probabilistic approaches 
that offer recourse with high probability. The first captures uncertainty over 
structural equations under additive Gaussian noise, and uses Bayesian model 
averaging to estimate the counterfactual distribution. The second removes any 
assumptions on the structural equations by instead computing the average effect 
of recourse actions on individuals similar to the person who seeks recourse, lead- 
ing to a novel subpopulation-based interventional notion of recourse. We then 
derive a gradient-based procedure for selecting optimal recourse actions, and 
empirically show that the proposed approaches lead to more reliable recommen- 
dations under imperfect causal knowledge than non-probabilistic baselines. This 
contribution is important as it enables recourse recommendations to be gener- 
ated in more practical settings and under uncertain assumptions. 

As a final note, while for simplicity, we have focused in this chapter on credit 
loan approvals, recourse can have potential applications in other domains such as 
healthcare [8,9,17,51], justice (e.g., pretrial bail) [4], and other settings (e.g., hir- 
ing) [12,44,57] whereby actionable recommendations for individuals are sought. 
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Abstract. Significant progress has been made by the advances in Gen- 
erative Adversarial Networks (GANs) for image generation. However, 
there lacks enough understanding of how a realistic image is generated 
by the deep representations of GANs from a random vector. This chapter 
gives a summary of recent works on interpreting deep generative mod- 
els. The methods are categorized into the supervised, the unsupervised, 
and the embedding-guided approaches. We will see how the human- 
understandable concepts that emerge in the learned representation can 
be identified and used for interactive image generation and editing. 


Keywords: Explainable machine learning - Generative adversarial 
networks - Image generation 


1 Introduction 


Over the years, great progress has been made in image generation by the 
advances in Generative Adversarial Networks (GANs) [6,12]. As shown in Fig. 1 
the generation quality and diversity have been improved substantially from the 
early DCGAN [16] to the very recent Alias-free GAN [11]. After the adversarial 
training of the generator and the discriminator, we can have the generator as 
a pretrained feedforward network for image generation. After feeding a vector 
sampled from some random distribution, this generator can synthesize a realistic 
image as the output. However, such an image generation pipeline doesn’t allow 
users to customize the output image, such as changing the lighting condition of 
the output bedroom image or adding a smile to the output face image. Moreover, 
it is less understood how a realistic image can be generated from the layer-wise 
representations of the generator. Therefore, we need to interpret the learned 
representation of deep generative models for understanding and the practical 
application of interactive image editing. 

This chapter will introduce the recent progress of the explainable machine 
learning for deep generative models. I will show how we can identify the human- 
understandable concepts in the generative representation and use them to steer 
the generator for interactive image generation. Readers might also be interested 
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Fig. 1. Progress of image generation made by different GAN models over the years. 


in watching a relevant tutorial talk I gave at CVPR’21 Tutorial on Interpretable 
Machine Learning for Computer Vision!. A more detailed survey paper on GAN 
interpretation and inversion can be found in [21]. 

This chapter focuses on interpreting the pretrained GAN models, but a simi- 
lar methodology can be extended to other generative models such as VAE. Recent 
interpretation methods can be summarized into the following three approaches: 
the supervised approach, the unsupervised approach, and the embedding-guided 
approach. The supervised approach uses labels or classifiers to align the mean- 
ingful visual concept with the deep generative representation; the unsupervised 
approach aims to identify the steerable latent factors in the deep generative 
representation through solving an optimization problem; the embedding-guided 
approach uses the recent pretrained language-image embedding CLIP [15] to 
allow a text description to guide the image generation process. 

In the following sections, I will select representative methods from each app- 
roach and briefly introduce them as primers for this rapidly growing direction. 


2 Supervised Approach 


GAN Sample of images Semantic segmentation Select a feature brush & strength and enjoy painting: 


sla =E. 


m) 


Analyze one unit Upsample — 


Fig. 2. GAN dissection framework and interactive image editing interface. Images are 
extracted from [3]. The method aligns the unit activation with the semantic mask of 
the output image, thus by turning up or down the unit activation we can include or 
remove the corresponding visual concept in the output image. 


Random 
Input 


1 https://youtu.be/PtRU2B6Iml4. 
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The supervised approach uses labels or trained classifiers to probe the represen- 
tation of the generator. One of the earliest interpretation methods is the GAN 
Dissection [4]. Derived from the previous work Network Dissection [3], GAN Dis- 
section aims to visualize and understand the individual convolutional filters (we 
term them as units) in the pretrained generator. It uses semantic segmentation 
networks [24] to segment the output images. It then calculates the agreement 
between the spatial location of the unit activation map and the semantic mask 
of the output image. This method can identify a group of interpretable units 
closely related to object concepts, such as sofa, table, grass, buildings. Those 
units are then used as switches where we can add or remove some objects such 
as a tree or lamp by turning up or down the activation of the corresponding units. 
The framework of GAN Dissection and the image editing interface are shown in 
Fig. 2. In the interface of GAN Dissection, the user can select the object to be 
manipulated and brush the output image where it should be removed or added. 

Besides steering the filters at the intermediate convolutional layer of the 
generator as the GAN Dissection does, the latent space where we sample the 
latent vector as input to the generator is also being explored. The underlying 
interpretable subspaces aligning with certain attributes of the output image can 
be identified. Here we denote the pretrained generator as G(.) and the random 
vector sampled from the latent space as z, and then the output image becomes 
I = G(z). Under different vectors, the output images become different. Thus the 
latent space encodes various attributes of images. If we can steer the vector z 
through one relevant subspace and preserve its projection to the other subspaces, 
we can edit one attribute of the output image in a disentangled way. 


Category: Segmentation ' 
bedroom Ss l 
Attributes: H 

> nature lighting ' 
wood z f 

tidy i 

e. ' 


Identifying interpretable boundaries in latent space 


— PF AA Hee 


Changing indoor lighting Adding clouds 


Fig. 3. We can use classifier to predict various attributes from the output image then 
go back to the latent space to identify the attribute boundaries. Images below show 
the image editing results achieved by [22]. 
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To align the latent space with the semantic space, we can first apply off- 
the-shelf classifiers to extract the attributes of the synthesized images and then 
compute the causality between the occurring attributes in the generated images 
and the corresponding vectors in the latent space. The HiGAN method proposed 
in [22] follows such a supervised approach as illustrated in Fig. 3: (1) Thousands 
of latent vectors are sampled, and the images are generated. (2) Various levels of 
attributes are predicted from the generated images by applying the off-the-shelf 
classifiers. (3) For each attribute a, a linear boundary ną is trained in the latent 
space using the predicted labels and the latent vectors. We consider it a binary 
classification and train a linear SVM to recognize each attribute. The weight of 
the trained SVM is na. (4) a counterfactual verification step is taken to pick up 
the reliable boundary. Here we follow a linear model to shift the latent code as 


I’ = G(z + Ana), (1) 


where the normal vector of the trained attribute boundary is denoted as n, and 
T is the edited image compared to the original image J. Then the difference 
between predicted attribute scores before and after manipulation becomes, 


Aa = + Y` max(F(G(2x + ma) — F(G(ax)),0); (2) 


here F(.) is the attribute predictor with the input image, and K is the number 
of synthesized images. Ranking Aa allows us to identify the reliable attribute 
boundaries out of the candidate set {na}, where there are about one hundred 
attribute boundaries trained from step 3 of the HiGAN method. After that, we 
can then edit the output image from the generator by adding or removing the 
normal vector of the target attribute on the original latent code. Some image 
manipulation results are shown in Fig. 3. 

Similar supervised methods have been developed to edit the facial 
attributes [17,18] and improve the image memorability [5]. Steerability of vari- 
ous attributes in GANs has also been analyzed [9]. Besides, the work of Style- 
Flow [1] replaces the linear model with a nonlinear invertible flow-based model 
in the latent space with more precise facial editing. Some recent work uses a 
differentiable renderer to extract 3D information from the image GANs for more 
controllable view synthesis [23]. For the supervised approach, many challenges 
remain for future work, such as expanding the annotation dictionary, achieving 
more disentangled manipulation, and aligning latent space with image region. 


3 Unsupervised Approach 


As generative models become more and more popular, people start training them 
on a wide range of images, such as cats and anime. To steer the generative models 
trained for cat or anime generation, following the previous supervised approach, 
we have to define the attributes of the images and annotate many images to 
train the classifiers. It is a very time-consuming process. 
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Alternatively, the unsupervised approach aims to identify the controllable 
dimensions of the generator without using labels/classifiers. 

SeFa [19] is an unsupervised approach for discovering the interpretable 
representation of a generator. It directly decomposes the pre-trained weights. 
More specifically, in the pre-trained generator of the popular StyleGAN [12] or 
PGGAN [10] model, there is an affine transformation between the latent code 
and the internal activation. Thus the manipulation model can be simplified as 


y' £Gi(z') = Gi(z+ an) = Az+b+aAn=y+aAn, (3) 


where y is the original projected code and y’ is the projected code after manip- 
ulation by n. From Eq. (3) we can see that the manipulation process is instance 
independent. In other words, given any latent code z together with a particular 
latent direction n, the editing can always be achieved by adding the term aAn 
onto the projected code after the first step. From this perspective, the weight 
parameter A should contain the essential knowledge of the image variation. 
Thus we aim to discover important latent directions by decomposing A in an 
unsupervised manner. We propose to solve the following optimization problem: 


k 
N* = arg max 5 ||Ani||5, (4) 
{NER1*k:nTn;=1 Vi=l,--- ,k} j=1 


where N = [nj,n2,--: , nk] correspond to the top k semantics sorted by their 
eigenvalues, and A is the learned weight in the affine transform between the 
latent code and the internal activation. This objective aims at finding the direc- 
tions that can cause large variations after the projection of A. The resulting 
solution becomes the eigenvectors of the matrix ATA. Those resulting direc- 
tions at different layers control different attributes of the output image, thus 
pushing the latent code z on the important directions {n,,n2,--- , ng} facili- 
tates the interactive image editing. Figure 4 shows some editing result. 
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Fig. 4. Manipulation results from SeFa [19] on the left and the interface for interactive 
image editing on the right. On the left, each attribute corresponds to some n; in the 
latent space of the generator. In the interface, user can simply drag each slider bar 
associating with certain attribute to edit the output image 
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Many other methods have been developed for the unsupervised discovery of 
interpretable latent representation. Härkönen et al. [7] perform PCA on the sam- 
pled data to find primary directions in the latent space. Voynov and Babenko [20] 
jointly learn a candidate matrix and a classifier such that the classifier can prop- 
erly recognize the semantic directions in the matrix. Peebles et al. [14] develops a 
Hessian penalty as a regularizer for improving disentanglement in training. He et 
al. [8] designs a linear subspace with an orthogonal basis in each layer of the gen- 
erator to encourage the decomposition of attributes. Many challenges remain for 
the unsupervised approach, such as how to evaluate the result from unsupervised 
learning, annotate each discovered dimension, and improve the disentanglement 
in the GAN training process. 


4 Embedding-Guided Approach 


The embedding-guided approach aligns language embedding with generative rep- 
resentations. It allows users to use any free-form text to guide the image gener- 
ation. The difference between the embedding-guided approach and the previous 
unsupervised approach is that the embedding-guided approach is conditioned on 
the given text to manipulate the image to be more flexible, while the unsuper- 
vised approach discovers the steerable dimensions in a bottom-up way thus it 
lacks fine-grained control. 

Recent work on StyleCLIP [13] combines the pretrained language-image 
embedding CLIP [15] and StyleGAN generator [12] for free-form text-driven 
image editing. CLIP is a pretrained embedding model from 400 million image- 
text pairs. Given an image I., it first projects it back into the latent space as 
w, using existing GAN inversion method. Then StyleCLIP designs the following 
optimization objective 


w* = argmin Doz rp(G(w),t) + Aza||w — wsll2 + ArpLip(w,ws), (5) 


where Dorrp(.,-) measure the distance between an image and a text using the 
pre-trained CLIP model, the second and the third terms are some regularizers 
to keep the similarity and identity with the original input image. Thus this 
optimization objective results in a latent code w* that generates an image close 
to the given text in the CLIP embedding space as well as similar to the original 
input image. StyleCLIP further develops some architecture design to speed up 
the iterative optimization. Figure 5 shows the text driven image editing results. 
Some concurrent work called Paint by Word from Bau et al. [2] combines 
CLIP embedding with region-based image editing. It has a masked optimization 
objective that allows the user to brush the image to provide the input mask. 
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a) “Mohawk hairstyle” 


Rustic 


b) At This Location... 


Paint This Word 


Fig.5. Text driven image editing results from a) StyleCLIP [13] and b) Paint by 
Word [2]. 


5 Concluding Remarks 


Interpreting deep generative models leads to a deeper understanding of how 
the learned representations decompose images to generate them. Discovering 
the human-understandable concepts and steerable dimensions in the deep gen- 
erative representations also facilitates the promising applications of interactive 
image generation and editing. We have introduced representative methods from 
three approaches: the supervised approach, the unsupervised approach, and 
the embedding-guided approach. The supervised approach can achieve the best 
image editing quality when the labels or classifiers are available. It remains chal- 
lenging for the unsupervised and embedding-guided approaches to achieve disen- 
tangled manipulation. More future works are expected on the accurate inversion 
of the real images and the precise local and global image editing. 
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Abstract. In reinforcement learning, an agent interacts with an envi- 
ronment from which it receives rewards, that are then used to learn a 
task. However, it is often unclear what strategies or concepts the agent 
has learned to solve the task. Thus, interpretability of the agent’s behav- 
ior is an important aspect in practical applications, next to the agent’s 
performance at the task itself. However, with the increasing complexity 
of both tasks and agents, interpreting the agent’s behavior becomes much 
more difficult. Therefore, developing new interpretable RL agents is of 
high importance. To this end, we propose to use Align-RUDDER as an 
interpretability method for reinforcement learning. Align-RUDDER is a 
method based on the recently introduced RUDDER framework, which 
relies on contribution analysis of an LSTM model, to redistribute rewards 
to key events. From these key events a strategy can be derived, guiding 
the agent’s decisions in order to solve a certain task. More importantly, 
the key events are in general interpretable by humans, and are often 
sub-tasks; where solving these sub-tasks is crucial for solving the main 
task. Align-RUDDER enhances the RUDDER framework with methods 
from multiple sequence alignment (MSA) to identify key events from 
demonstration trajectories. MSA needs only a few trajectories in order 
to perform well, and is much better understood than deep learning mod- 
els such as LSTMs. Consequently, strategies and concepts can be learned 
from a few expert demonstrations, where the expert can be a human or 
an agent trained by reinforcement learning. By substituting RUDDER’s 
LSTM with a profile model that is obtained from MSA of demonstra- 
tion trajectories, we are able to interpret an agent at three stages: First, 
by extracting common strategies from demonstration trajectories with 
MSA. Second, by encoding the most prevalent strategy via the MSA 
profile model and therefore explaining the expert’s behavior. And third, 
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by allowing the interpretation of an arbitrary agent’s behavior based on 
its demonstration trajectories. 


Keywords: Explainable AI - Contribution analysis - Reinforcement 
learning - Credit assignment - Reward redistribution 


1 Introduction 


With recent advances in computing power together with increased availability 
of large datasets, machine learning has emerged as a key technology for modern 
software systems. Especially in the fields of computer vision [34,52] and natural 
language processing [14,71] vast improvements have been made using machine 
learning. 

In contrast to computer vision and natural language processing, which are 
both based on supervised learning, reinforcement learning is more general as it 
constructs agents for planning and decision-making. Recent advances in rein- 
forcement learning have resulted in impressive models that are capable of sur- 
passing humans in games [39,58,73]. However, reinforcement learning is still 
waiting for its breakthrough in real world applications, not least because of two 
issues. First, the amount of human effort and computational resources required 
to develop and train reinforcement learning systems is prohibitively expensive for 
widespread adoption. Second, machine learning and in particular reinforcement 
learning produces black box models, which do not allow explaining model out- 
comes and to build trust in these models. The insufficient explainability limits 
the application of reinforcement learning agents, therefore reinforcement learning 
is often limited to computer games and simulations. 

Advances in the field of explainable AI (XAI) have introduced methods and 
techniques to alleviate the problem of insufficient explainability for supervised 
machine learning [3-5,41,42,64]. However, these XAI methods cannot explain 
the behavior of the more complex reinforcement learning agents. Among other 
problems, delayed and sparse rewards or hand-crafted reward functions make it 
hard to explain an agent’s final behavior. Therefore, interpreting and explaining 
agents trained with reinforcement learning is an integral component for viably 
moving towards real-world reinforcement learning applications. 

We explore the current state of explainability methods and their applica- 
bility in the field of reinforcement learning and introduce a method, Align- 
RUDDER [45], which is intrinsically explainable by exposing the global strategy 
of the trained agent. The paper is structured as follows: In Sect. 2.1 we review 
explainability methods and how they can be categorized. Sect. 2.2 defines the set- 
ting of reinforcement learning. In Sect. 2.3, Sect. 2.4 and Sect. 2.5 we explore the 
problem of credit assignment and potential solutions from the field of explainable 
AI. In Sect. 2.6 we review the concept of reward redistribution as a solution for 
credit assignment. Section 3 introduces the concept of strategy extraction and 
explores its potential for training reinforcement learning agents (Sect. 3.1) as 
well as its intrinsic explainability (Sect. 3.2) and finally its usage for explaining 
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arbitrary agent behaviors in Sect. 4. Finally, in Sect. 5 we explore limitations of 
this approach before concluding in Sect. 6. 


2 Background 


2.1 Explainability Methods 


The importance of explainability methods to provide insights into black box 
machine learning methods such as deep neural networks has significantly 
increased in recent years [72]. These methods can be categorized based on mul- 
tiple factors [15]. 

First, we can distinguish local and global methods, where global methods 
explain the general model behavior, while local models focus on explaining spe- 
cific decisions (e.g. explain the classification of a specific sample) or the influence 
of individual features on the model output [1,15]. Second, we distinguish between 
intrinsically explainable models and post-hoc methods [15]. Intrinsically explain- 
able models are designed to provide explanations as well as model predictions. 
Examples for such models are decision trees [35], rule-based models [75], linear 
models [22] or attention models [13]. Post-hoc methods are applied to existing 
models and often require a second model to provide explanations (e.g. approxi- 
mate an existing model with a linear model that can be interpreted) or provide 
limited explanations (e.g. determine important input features but no detailed 
explanations of the inner workings of a model). While intrinsically explainable 
models offer more detailed explanations and insights, they often sacrifice predic- 
tive performance. Post-hoc methods, in contrast, have little to no influence on 
predictive performance but lack detailed explanations of the model. 

Post-hoc explainability methods often provide insights in the form of attri- 
butions, i.e. a measure of how important certain features are with regard to the 
model’s output. In Fig. 1 we illustrate the model attribution from input towards 
its prediction. We further categorize attribution methods into sensitivity analysis 
and contribution analysis. 


\ X 
\\ cow 


model attribution 


Fig. 1. Illustration of model input attributions towards its prediction [46]. 
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Sensitivity analysis methods, or “backpropagation through a model” [8,43,50, 
51], provide attributions by calculating the gradient of the model with respect 
to its input. The magnitude of these gradients is then used to assign a measure 
of importance to individual features of the input. While sensitivity analysis is 
typically simple to implement, these methods have several problems such as sus- 
ceptibility to local minima, instabilities, exploding or vanishing gradients and 
proper exploration [28,54]. The major drawback, however, is that the relevance 
of features can be missed since it does not consider their contribution to the out- 
put but only how small perturbations of features change the output. Therefore, 
important features can receive low attribution scores as small changes would 
not result in a significant change of the model’s output, but removing them 
would completely change the output. A prominent example for sensitivity anal- 
ysis methods are saliency maps [59]. 

Contribution analysis methods provide attributions based on the contribution 
of individual features to the model output, and therefore do not suffer from the 
drawbacks of sensitivity analysis methods. This can be achieved in a variety of 
ways, prominent examples are integrated gradients [64] or layer-wise relevance 
propagation (e-LRP) [11]. 

To illustrate the differences between sensitivity analysis and contribution 
analysis, we can consider a model y = f(a) that takes an n-dimensional input 
vector £ = {£1,..., £n} € R” and predicts a k-dimensional output vector y = 
{y1,---; yx} E€ R*. We then define an n-dimensional attribution vector R* = 
{R*,..., RE} € R” for the k-th output unit, which provides the relevance of each 
input value towards its final prediction. The attribution is obtained through the 
model gradient: 


Ry(a) = FS). (1) 


although this is not the only option for attribution through gradients. Alterna- 
tively, the attribution can be defined by multiplying the input vector with the 
model gradient [6]: 


Of (a) 
Ox; ` 


R(x) = q; (2) 

Considering Eq. 1 we answer the question of “What do we need to change in 
x to get a certain outcome ypg ?”, while considering Eq. 2 we answer the question 
of “How much did x; contribute to the outcome yp?” [1]. 

In reinforcement learning, we are interested in assessing the contributions of 
actions along a sequence which were relevant for achieving a particular return. 
Therefore, we are interested in contribution analysis methods rather than sensi- 
tivity analysis. We point out that this is closely related to the credit assignment 
problem, which we will further elaborate in the following sections. 
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2.2 Reinforcement Learning 


In reinforcement learning, an agent is trained to take a sequence of actions by 
interacting with an environment and by learning from the feedback provided 
by the environment. The agent selects actions based on its policy, which are 
executed in the environment. The environment then transitions into its next 
state based on state-transition probabilities, and the agent receives feedback in 
the form of the next state and a reward signal. The objective of reinforcement 
learning is to learn a policy that maximizes the expected cumulative reward, 
also called return. 

More formally, we define our problem setting as a finite Markov decision 
process (MDP) P as a 5-tuple P = (8,A,R,p,y) of finite sets $ with states 
s (random variable S; at time t), A with actions a (random variable A;), and 
R with rewards r (random variable R;+1) [47]. Furthermore, P has transition- 
reward distributions p(S;41 = s’,Riz1 = r | St = s, Az = a) conditioned on 
state-actions, a policy given as action distributions 7(Az41 = a’ | Si41 = 8’) 
conditioned on states, and a discount factor y € [0,1]. The return G; is G; = 
peer 7" Ritk+1. We often consider finite horizon MDPs with sequence length 
T and y = 1 giving G; = T Rt+k+1. The state-value function V” (s) for a 
policy 7 is 

V” (s) = Er [Gi | Si = s] 


and its respective action-value function Q” (s, a) is 


Q” (s, a) = Er [Gi | Si = S, Át = al è 


The goal of reinforcement learning is to maximize the expected return at time 
t = 0, that is vf = E+ [Go]. The optimal policy a* is 7* = argmax [vg]. We 
consider the difficult task of learning a policy when the reward given by the 
environment is sparse or delayed. An integral part to facilitate learning in this 
challenging setting is credit assignment, i.e. to determine the contribution of 
states and actions towards the return. 


2.3 Credit Assignment in Reinforcement Learning 


In reinforcement learning, we face two fundamental problems. First, the trade- 
off between exploring actions that lead to promising new states and exploiting 
actions that maximize the return. Second, the credit assignment problem, which 
involves correctly attributing credit to actions in a sequence that led to a certain 
return or outcome [2,66,67]. Credit assignment becomes more difficult as the 
delay between selected actions and their associated rewards increases [2,45]. 
The study of credit assignment in sequences is a long-standing challenge and 
has been around since the start of artificial intelligence research [38]. Chess is 
an example of a sparse and delayed reward problem, where the reward is given 
at the end of the game. Assigning credit to the large number of decisions taken 
in a game of chess is quite difficult when the feedback is received only at the 
end of the game (i.e. win, lose or draw). It is difficult for the learning system to 
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identify which actions were more or less important for the resulting outcome. As 
a result, the notion of winning or losing alone is often not informative enough 
for learning systems [38]. This motivates the need to improve credit assignment 
methods, especially for problems with sparse and delayed rewards. We further 
elaborate on various credit assignment methods in the next section. 


2.4 Methods for Credit Assignment 


Credit assignment in reinforcement learning can be classified into two different 
classes: 1) Structural credit assignment, and 2) Temporal credit assignment [66]. 
Structural credit assignment is related to the internals of the learning system that 
lead to choosing a particular action. Backpropagation [27] is quite popular for 
such structural credit assignment in Deep Reinforcement Learning. In contrast, 
temporal credit assignment is related to the events (states and/or actions) which 
led to a particular outcome in a sequence. In this work, we examine temporal 
credit assignment methods in detail. 

Temporal credit assignment methods are used to obtain policies which maxi- 
mize future rewards. Temporal difference (TD) learning [67] is a temporal credit 
assignment method which has close ties to dynamic programming and the Bell- 
man operator [10]. It combines policy evaluation and improvement in a single 
step, by using the maximum action-value estimate at the next state to improve 
the action-value estimate at the current state. However, TD learning suffers from 
high bias and slows down learning when the rewards are sparse and delayed. Eli- 
gibility traces and TD(A) [60] were introduced to ameliorate the performance 
of TD. Instead of looking one step into the future, information from n-steps in 
the future or past are used to update the current estimate of the action-value 
function. However, the performance of the algorithm is highly dependent on how 
much further in the future or in the past it looks into. In TD learning, one tries 
to find the action-value which maximizes the future return. In contrast, there 
exist direct policy optimization methods like policy gradient [65] and related 
methods like actor-critic [40,56]. 

More recent attempts to tackle credit assignment for delayed and sparse 
rewards have been made in RUDDER: Return Decomposition for Delayed 
Rewards (RUDDER) [2] and Hindsight Credit Assignment (HCA) [21]. RUD- 
DER aims to identify actions which increase or decrease the expected future 
return. These actions are assigned credit directly by RUDDER, which makes 
learning faster by reducing the delay. We discuss RUDDER in detail in Sect. 2.6. 

Unlike RUDDER, HCA assigns credit by estimating the likelihood of past 
actions having led to the observed outcome and consequently uses hindsight 
information to assign credit to past decisions. Both methods have in common, 
that the credit assignment problem is framed as a supervised learning task. In 
the next section, we look at credit assignment from the lens of explainability 
methods. 
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2.5 Explainability Methods for Credit Assignment 


We have established that assigning credit to individual states, actions or state- 
action events along a sequence, which is also known as a trajectory or episode 
in reinforcement learning terminology, can tremendously simplify the task of 
learning an optimal policy. Therefore, if a method is able to determine which 
events were important for a certain outcome, it can be used to study sequences 
generated by a policy. As explainability methods were designed for this purpose, 
we can employ them to assign credit to important events and therefore speed 
up learning. As we have explored in Sect. 2.1, there are several methods we can 
choose from. The choice between intrinsically explainable models and post-hoc 
methods depends on whether a method can be combined with a reinforcement 
learning algorithm and is able to solve the task. In most cases, post-hoc methods 
are preferable, as they do not restrict the learning algorithm and model class. 
Since we are mainly interested in temporal credit assignment, we will look at 
explainability methods with a global scope. Sensitivity analysis methods have 
many drawbacks (see Sect. 2.1) and are therefore not suited for this purpose. 
Thus, we want to use contribution analysis methods. 


2.6 Credit Assignment via Reward Redistribution 


RUDDER [2] demonstrates how contribution analysis methods can be applied 
to target the credit assignment problem. RUDDER redistributes the return 
to relevant events and therefore sets future reward expectations to zero. The 
reward redistribution is achieved through return decomposition, which reduces 
high variance compared to Monte Carlo methods and high biases compared to 
TD methods [2]. This is possible because the state-value estimates are simplified 
to compute averages of immediate rewards. 

In a common reinforcement learning setting, one can assign credit to an 
action a when receiving a reward r by updating a policy 7(a|s) according to its 
respective @-function estimates. However, one fails when rewards are delayed, 
since the value network has to average over a large number of probabilistic 
future state-action paths that increase exponentially with the delay of the reward 
[36,48]. In contrast to using a forward view, a backward view approach based on 
a backward analysis of a forward model avoids problems with unknown future 
state-action paths, since the sequence is already completed and known. Backward 
analysis transforms the forward view approach into a regression task, at which 
deep learning methods excel. As a forward model, an LSTM can be trained to 
predict the final return, given a sequence of state-actions. LSTM was already 
used in reinforcement learning [55] for advantage learning [7] and learning policies 
[23,24,40]. Using contribution analysis, RUDDER can decompose the return 
prediction (the output relevance) into contributions of single state-action pairs 
along the observed sequence, obtaining a redistributed reward (the relevance 
redistribution). As a result, anew MDP is created with the same optimal policies 
and, in the optimal case, with no delayed rewards (expected future rewards equal 
zero) [2]. Indeed, for MDPs the Q-value is equal to the expected immediate 
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reward plus the expected future rewards. Thus, if the expected future rewards are 
zero, the Q-value estimation simplifies to computing the mean of the immediate 
rewards. 

Therefore, in the context of explainable AI, RUDDER uses contribution anal- 
ysis to decompose the return prediction (the output relevance) into contributions 
of single state-action pairs along the observed sequence. RUDDER achieves this 
by training an LSTM model to predict the final return of a sequence of state- 
actions as early as possible. By taking the difference of the predicted returns 
from two consecutive state-actions, the contribution to the final return can be 
inferred [2]. 


Sequence-Markov Decision Processes (SDPs). An optimal reward redis- 
tribution should transform a delayed reward MDP into a return-equivalent MDP 
with zero expected future rewards. However, given an MDP, setting future 
rewards equal to zero is in general not possible. Therefore, RUDDER introduces 
sequence-Markov decision processes (SDPs), for which reward distributions are 
not required to be Markovian. An SDP is defined as a decision process which 
is equipped with a Markov policy and has Markov transition probabilities but 
a reward that is not required to be Markovian. Two SDPs P and P are return- 
equivalent, if (i) they differ only in their reward distribution and (ii) they have 
the same expected return at t = 0 for each policy m: ù = vg. RUDDER con- 
structs a reward redistribution that leads to a return-equivalent SDP with a 
second-order Markov reward distribution and expected future rewards that are 
equal to zero. For these return-equivalent SDPs, Q-value estimation simplifies 
to computing the mean. 


Return Equivalence. Strictly return-equivalent SDPs P and P can be con- 
structed by reward redistributions. Given an SDP P, a reward redistribution is a 
procedure that redistributes for each sequence so, ao, ..., $7, 47 the realization of 
the sequence-associated return variable Go = ear Risa or its expectation along 
the sequence. The reward redistribution creates a new SDP P with the redis- 
tributed reward R+; at time (t + 1) and the return variable Go = ee Rizr. 
A reward redistribution is second-order Markov if the redistributed reward R:4+1 
depends only on (s¢—1, @¢—1, St, a+). If the SDP P is obtained from the SDP P by 
reward redistribution, then P and P are strictly return-equivalent. Theorem 1 
in RUDDER states that the optimal policies remain the same for P and P [2]. 


Reward Redistribution. We consider that a delayed reward MDP P, with a 
particular policy 7, can be transformed into a return-equivalent SDP P with an 
optimal reward redistribution and no delayed rewards: 


Definition 1 ([2]). For 1 <t < T and0 < m <S T — t, the expected sum of 
delayed rewards at time (t — 1) in the interval [t + 1,t + m + 1] is defined as 
K(m,t — 1) = Er [XD o Ri+1+7 | St-1, 0-1]: 


Theorem 2 ([2]). We assume a delayed reward MDP P, where the accumulated 
reward is given at sequence end. A new SDP P is obtained by a second-order 
Markov reward redistribution, which ensures that P is return-equivalent to P. 
For a specific n, the following two statements are equivalent: 
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(I) Kk(T—t-—1,t)=0,i.e. the reward redistribution is optimal, 
(II) E[Re+1 | St-1,at-1, Stat] = F (St, a4) — T (Se—-1, at—1). 


An optimal reward redistribution fulfills for 1 < t <S T and0 S m<T-t: 
K(m,t—1) =0. 


Theorem 2 shows that an optimal reward redistribution can be obtained by 
a second-order Markov reward redistribution for a given policy. It is an exis- 
tence proof which explicitly gives the expected redistributed reward. In addi- 
tion, higher-order Markov reward redistributions can also be optimal. In case 
of higher-order Markov reward redistribution, Equation (II) in Theorem 2 can 
have random variables R¿}ı that depend on arbitrary states that are visited in 
the trajectory. Then Equation (II) averages out all states except s; and s;_1 and 
averages out all randomness. In particular, this is also interesting for Align- 
RUDDER, since it can achieve an optimal reward redistribution. Therefore, 
although Align-RUDDER is in general not second-order Markov, Theorem 2 
still holds in case of optimality. 

For RUDDER, reward redistribution as in Theorem 2 can be achieved 
through return decomposition by predicting 7741 € Rr41 of the original MDP 
P by a function g from the state-action sequence. RUDDER determines for each 
sequence element its contribution to the prediction of 774 1 at the end of the 
sequence. Therefore, it performs backward analysis through contribution analy- 
sis. Contribution analysis computes the contribution of the current input to the 
final prediction, i.e. the information gain by the current input on the final predic- 
tion. In principle, RUDDER could use any contribution analysis method. How- 
ever, RUDDER prefers three methods: (A) differences of return predictions, (B) 
integrated gradients (IG) [64], and (C) layer-wise relevance propagation (LRP) 
[5]. For contribution method (A), RUDDER ensures that g predicts the final 
reward 77+ at every time step. Hence, the change in prediction is a measure 
of the contribution of an input to the final prediction and assesses the informa- 
tion gain by this input. The redistributed reward is given by the difference of 
consecutive predictions. In contrast to method (A), methods (B) and (C) use 
information from later on in the sequence for determining the contribution of 
the current input. Thus, a non-Markovian reward is introduced, as it depends on 
later sequence elements. However, the non-Markovian reward must be viewed as 
probabilistic reward, which is prone to have high variance. Therefore, RUDDER 
prefers method (A). 

A principle insight on which RUDDER is based, is that the Q-function of 
optimal policies for complex tasks resembles a step function as they are hierar- 
chical and composed of sub-tasks (blue curve, row 1 of Fig. 2, right panel). Com- 
pleting such a sub-task is then reflected by a step in the @-function. Therefore, 
a step in the Q-function is a change in return expectation, that is, the expected 
amount of the return or the probability to obtain the return changes. With 
return decomposition one identifies the steps of the Q-function (green arrows 
in Fig. 2, right panel), and an LSTM can therefore predict the expected return 
(red arrow, row 1 of Fig. 2, right panel), given the state-action sub-sequence to 
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Fig. 2. Basic insight into reward redistribution [45]. Left panel, Row 1: An 
agent has to take a key to unlock a door. Both events increase the probability of 
receiving the treasure, which the agent always gets as a delayed reward, when the 
door is unlocked at sequence end. Row 2: The @-function approximation typically 
predicts the expected return at every state-action pair (red arrows). Row 3: However, 
the Q-function approximation requires only to predict the steps (red arrows). Right 
panel, Row 1: The Q-function is the future-expected return (blue curve). Green 
arrows indicate Q-function steps and the big red arrow the delayed reward at sequence 
end. Row 2 and 3: The redistributed rewards correspond to steps in the Q-function 
(small red arrows). Row 4: After redistributing the reward, only the redistributed 
immediate reward remains (red arrows). Reward is no longer delayed. (Color figure 
online) 


redistribute the reward. The prediction is decomposed into single steps of the 
Q-function (green arrows in Fig. 2). The redistributed rewards (small red arrows 
in second and third row of right panel of Fig.2) remove the steps. Thus, the 
expected future reward is equal to zero (blue curve at zero in last row in right 
panel of Fig. 2). Future rewards of zero means that learning the Q-values sim- 
plifies to estimating the expected immediate rewards (small red arrows in right 
panel of Fig. 2), since delayed rewards are no longer present. Also, Hindsight 
Credit Assignment [21] identifies such Q-function steps that stem from actions 
alone. Figure 2 further illustrates how a Q-function predicts the expected return 
from every state-action pair, and how it is prone to prediction errors that ham- 
per learning (second row, left panel). Since the Q-function is mostly constant, 
it is not necessary to predict the expected return for every state-action pair. It 
is sufficient to identify relevant state-actions across the whole episode and use 
them for predicting the expected return. This is achieved by computing the dif- 
ference of two subsequent predictions of the LSTM model. If a state-action pair 
increases the prediction of the return, it is immediately rewarded. Using state- 
action sub-sequences (s,@)o:+ = (S0,@o0,---,,@¢), the redistributed reward is 
Rizr = 9((s,a)o:4) — 9((S,@)o:4-1), where g is the return decomposition func- 
tion, which is represented by an LSTM model and predicts the return of the 
episode. The LSTM model first learns to approximate the largest steps of the Q- 
function, since they reduce the prediction error the most. Therefore, the LSTM 
model extracts first the relevant state-actions pairs (events). Furthermore, the 
LSTM network [29-32] can store the relevant state-actions in its memory cells 
and subsequently, only updates its states to change its return prediction, when a 
new relevant state-action pair is observed. Thus, the LSTM return prediction is 
constant at most time points and does not have to be learned. The basic insight 
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that Q-functions are step functions is the motivation for identifying these steps 
via return decomposition to speed up learning through reward redistribution, 
and furthermore enhance explainability through its state-action contributions. 
In conclusion, redistributed reward serves as reward for a subsequent learning 
method [2]: (A) The Q-values can be directly estimated [2], which is also shown 
in Sect. 3 for the artificial tasks and Behavioral Cloning (BC) [70] pre-training 
for the Minecraft environment [19]. (B) The redistributed rewards can serve for 
learning with policy gradients like Proximal Policy Optimization (PPO) [57], 
which is also used in the Minecraft experiments for full training. (C) The redis- 
tributed rewards can serve for temporal difference learning, like Q-learning [74]. 


3 Strategy Extraction via Reward Redistribution 


A strategy is a sequence of events which leads to a desirable outcome. Assum- 
ing a sequence of events is provided, the extraction of a strategy is the process 
of extracting events which are important for the desired outcome. This outcome 
could be acommon state or return achieved at the end of the sequences. For exam- 
ple, if the desired outcome is to construct a wooden pickaxe in Minecraft, astrategy 
extracted from human demonstrations might contain event sequences for collect- 
ing a log, making planks, crafting a crafting table and finally a wooden pickaxe. 

Strategy extraction is useful to study policies and also demonstration 
sequences. High return episodes can be studied to extract a strategy achiev- 
ing such high returns. For example, Minecraft episodes where a stone pickaxe is 
obtained will include a strategy to make a wooden pickaxe, followed by collect- 
ing stones and finally the stone pickaxe. Similarly, strategies can be extracted 
from low return episodes, which can be helpful in learning which events to avoid. 
Extracted strategies explain the behavior of underlying policies or demonstra- 
tions. Furthermore, by comparing new trajectories to a strategy obtained from 
high return episodes, the reward signal can be redistributed to those events that 
are necessary for following the strategy and therefore are important. 

However, current exploration strategies struggle with discovering episodes 
with high rewards in complex environments with delayed rewards. Therefore, 
episodes with high rewards are assumed and are given as demonstrations, such 
that they do not have to be discovered by exploration. Unfortunately, the number 
of demonstrations is typically small, as obtaining them is often costly and time- 
consuming. Therefore, deep learning methods that require a large amount of 
data, such as RUDDER’s LSTM model, will not work well for this task while 
Align-RUDDER can learn a good strategy from as few as two demonstrations. 

Reward redistribution identifies events which lead to an increase (or decrease) 
in expected return. The sequence of important events is the strategy. Thus, 
reward redistribution can be used to extract strategies. We illustrate this on the 
example of profile models in Sect.3.1. Furthermore, a strategy can be used to 
redistribute reward by comparing a new sequence to an already given strategy. 
This results in faster learning, and is explained in detail in Sect. 3.2. Finally, we 
study expert episodes for the complex task of mining a diamond in Minecraft in 
Sect. 4.2. 
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conservation score reward redistribution 


Fig. 3. The function of a protein is largely determined by its structure [45]. The relevant 
regions of this structure are even conserved across organisms, as shown in the left panel. 
Similarly, solving a task can often be decomposed into sub-tasks which are conserved 
across multiple demonstrations. This is shown in the right panel, where events are 
mapped to the letter code for amino acids. Sequence alignment makes those conserved 
regions visible and enables redistribution of reward to important events. 


3.1 Strategy Extraction with Profile Models 


Align-RUDDER introduced techniques from sequence alignment to replace the 
LSTM model from RUDDER by a profile model for reward redistribution. The 
profile model is the result of a multiple sequence alignment of the demonstra- 
tions and allows aligning new sequences to it. Both the sub-sequences (s, a)o:t—1 
and (s,a)ox are mapped to sequences of events and are then aligned to the 
profile model. Thus, both sequences receive an alignment score S, which is 
proportional to the return decomposition function g. Similar to the LSTM 
model, Align-RUDDER identifies the largest steps in the Q-function via rele- 
vant events determined by the profile model. The redistributed reward is again 
Rizi = g((s,a)o:t) — g((s,@)o:4-1) (see Eq. (3)). Therefore, redistributing the 
reward by sequence alignment fits into the RUDDER framework with all its the- 
oretical guarantees. RUDDER is valid and works if its LSTM is replaced by other 
recurrent networks, attention mechanisms, or, as in case of Align-RUDDER, 
sequence and profile models [2]. 


Reward Redistribution by Sequence Alignment. In bioinformatics, 
sequence alignment identifies similarities between biological sequences to deter- 
mine their evolutionary relationship [44,62]. The result of the alignment of mul- 
tiple sequences is a profile model. The profile model is a consensus sequence, 
a frequency matrix, or a Position-Specific Scoring Matrix (PSSM) [63]. New 
sequences can be aligned to a profile model and receive an alignment score that 
indicates how well the new sequences agree to the profile model. 
Align-RUDDER uses such alignment techniques to align two or more high 
return demonstrations. For the alignment, Align-RUDDER assumes that the 
demonstrations follow the same underlying strategy, therefore they are similar 
to each other analogous to being evolutionary related. Figure3 shows an align- 
ment of biological sequences and an alignment of demonstrations where events 
are mapped to letters. If the agent generates a state-action sequence (s, a)o:t—1, 
then this sequence is aligned to the profile model g giving a score g((s, @)o:4-1). 
The next action of the agent extends the state-action sequence by one state- 
action pair (s;,a;). The extended sequence (s, a)o:+ is also aligned to the profile 
model g, giving another score g((s,@)o:z). The redistributed reward R+, is the 
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difference of these scores: Ri41 = g((s,@)o:4) — g((8,@)o:+-1) (see Eq. (3)). This 
difference indicates how much of the return is gained or lost by adding another 
sequence element. Align-RUDDER scores how close an agent follows an under- 
lying strategy, which has been extracted by the profile model. 

The new reward redistribution approach consists of five steps, see Fig. 4: 
(I) Define events to turn episodes of state-action sequences into sequences of 
events. (II) Determine an alignment scoring scheme, so that relevant events are 
aligned to each other. (III) Perform a multiple sequence alignment (MSA) of the 
demonstrations. (IV) Compute the profile model like a PSSM. (V) Redistribute 
the reward: Each sub-sequence 7; of a new episode 7 is aligned to the profile. 
The redistributed reward R;+1 is proportional to the difference of scores S' based 
on the PSSM given in step (IV), i.e. Resi « S(T) — S(Ti—1). 


I) Defining Events II) Scoring Matrix III) Multiple Sequence Alignment 
a| aa (foe E eee 2 
2 aaa | LE BH mt mz i 
oE Ei aa a; DT | ae EE H 
a E a | H 4 E ETEN E | 
IV) PSSM and Profile V) Reward Redistribution 
F Ia 
Model 
7 M 8 = E sc) E 
7-1 M | S(re-1) 
Rega = (S(t) - S(te-1)) © 


Fig. 4. The five steps of Align-RUDDER’s reward redistribution [45]. (I) Define events 
and turn demonstrations into sequences of events. Each block represent an event to 
which the original state is mapped. (II) Construct a scoring matrix using event proba- 
bilities from demonstrations for diagonal elements and setting off-diagonal to a constant 
value. (III) Perform an MSA of the demonstrations. (IV) Compute a PSSM. Events 
with the highest column scores are indicated at the top row. (V) Redistribute reward 
as the difference of scores of sub-sequences aligned to the profile. 


In the following, the five steps of Align-RUDDER’s reward redistribution are 
explained in detail. 


(I) Defining Events. Align-RUDDER considers differences of consecutive 
states to detect a change caused by an important event like achieving a sub- 
task'. An event is defined as a cluster of state differences, where similarity- 
based clustering like affinity propagation (AP) [18] is used. If states are only 
enumerated, it is suggested to use the “successor representation” [12] or “suc- 
cessor features” [9]. In Align-RUDDER, the demonstrations are combined with 
state-action sequences generated by a random policy to construct the successor 
representation. 


1 Any sequence of events can be used for clustering and reward redistribution, and 
consequently for sub-task extraction. 
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A sequence of events is obtained from a state-action sequence by mapping 
states s to its cluster identifier e (the event) and ignoring the actions. Alignment 
techniques from bioinformatics assume sequences composed of a few events, e.g. 
20 events. If there are too many events, good fitting alignments cannot be dis- 
tinguished from random alignments. This effect is known in bioinformatics as 
“Inconsistency of Maximum Parsimony” [16]. 


(II) Determining the Alignment Scoring System. A scoring matrix $ with 
entries s;,; determines the score for aligning event i with j. A priori, we only 
know that a relevant event should be aligned to itself but not to other events. 
Therefore, we set $;,; = 1/p; for i = j and s;,; = a for i # j. Here, p; is the 
relative frequency of event i in the demonstrations. œ is a hyperparameter, which 
is typically a small negative number. This scoring scheme encourages alignment 
of rare events, for which p; is small. 


(III) Multiple Sequence Alignment (MSA). An MSA algorithm maximizes 
the sum of all pairwise scores SMsa = D ae Si jt; tjt in an alignment, 
where $;,;,t,,t;,t 18 the score at alignment column t for aligning the event at posi- 
tion t; in sequence 7 to the event at position t; in sequence j. L > T is the 
alignment length, since gaps make the alignment longer than the length of each 
sequence. Align-RUDDER uses ClustalW [69] for MSA. MSA constructs a guid- 
ing tree by agglomerative hierarchical clustering of pairwise alignments between 
all demonstrations. This guiding tree allows identifying multiple strategies. 


(IV) Position-Specific Scoring Matrix (PSSM) and MSA Profile 
Model. From the alignment, Align-RUDDER constructs a profile model as a) 
column-wise event probabilities and b) a PSSM [63]. The PSSM is a column-wise 
scoring matrix to align new sequences to the profile model. 


(V) Reward Redistribution. The reward redistribution is based on the profile 
model. A sequence T = eo:7 (er is event at position t) is aligned to the profile, 
which gives the score S(T) = Dar Sı,- Here, sz, is the alignment score for the 
event e, at position l in the alignment. Alignment gaps are columns to which 
no event was aligned, which have tı = T +1 with gap penalty s) 741. If Te = eo 
is the prefix sequence of 7 of length t + 1, then the reward redistribution Ri4+1 
for0O< t<Tis 


Risa = (S(t) — S(t-1)) C 
= g((s,@)ot) — g((s, @)o:t-1), (3) 


T 
Rri2 = Go- 5 Rt+1, 
t=0 


where C = Egemo [Go] /Edemo ps S(T) —S(%1)| with S(71) = 0. The 


original return of the sequence T is Go = ae Ris, and the expectation of 
the return over demonstrations is Egemo. The constant C scales Ry+1 to the 
range of Go. Rr42 is the correction of the redistributed reward [2], with zero 


XAI and Strategy Extraction via Reward Redistribution 191 


expectation for demonstrations: Eqemo |Rr+2] = 0. Since ™% = eo. and es = 
f (St, ae), ia g((s, a)o:t) = S(T:)C. Strict return-equivalence [2] is ensured by 
Go = Brae 0 Rig. = = Go. The redistributed reward depends only on the past: 
Ri+1 = A((s, a)o:t). 


Higher-Order Markov Reward Redistribution. Align-RUDDER may lead 
to higher-order Markov redistribution. However, Corollary 1 in the Appendix of 
[45] states that the optimality criterion from Theorem 2 in Arjona-Medina et 
al. [2] also holds for higher-order Markov reward redistribution, if the expected 
redistributed higher-order Markov reward is the difference of Q-values. In that 
case, the redistribution is optimal, and there is no delayed reward. Furthermore, 
the optimal policies are the same as for the original problem. This corollary 
is the motivation for redistributing the reward to the steps in the @-function. 
Furthermore, Corollary 2 in the Appendix of [45] states that under a condition, 
an optimal higher-order reward redistribution can be expressed as the difference 
of Q-values. 


3.2 Explainable Agent Behavior via Strategy Extraction 


The reward redistribution identifies sub-tasks as alignment positions with high 
redistributed rewards. These sub-tasks are indicated by high scores s$ in the 
PSSM. Reward redistribution also determines the terminal states of sub-tasks, 
since it assigns rewards for solving the sub-tasks. As such, the strategy for solving 
a given task is extracted from those demonstrations used for alignment and 
represented as a sequence of sub-tasks. By assigning rewards to these sub-tasks 
with Align-RUDDER, a policy can be learned that is also able to achieve these 
sub-tasks and therefore high returns. 

While RUDDER with an LSTM model for reward redistribution is also 
able to assign reward to important events, in practice it is not easy to iden- 
tify sub-tasks. Changes in predicted reward from one event to the next are 
often small, as it is difficult for an LSTM model to learn sharp increases or 
decreases. Furthermore, it would be necessary to inspect a relatively large num- 
ber of episodes to identify common sub-tasks. In contrast, the sub-tasks extracted 
via sequence alignment are often easy to interpret and can be obtained from only 
a few episodes. The strategy of agents trained via Align-RUDDER can easily be 
explained by inspecting the alignment and visualizing the sequence of aligned 
events. As the strategy represents the global long-term behavior of an agent, its 
behavior can be interpreted through the strategy. 


4 Experiments 


Using several examples we show how reward redistribution with Align-RUDDER 
enables learning a policy with only a few demonstrations, even in highly complex 
environments. Furthermore, the strategy these policies follow is visualized, high- 
lighting the ability of Align-RUDDER’s alignment-based approach to interpret 
agent behavior. 
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4.1 Gridworld 


First, we analyze Align-RUDDER on two artificial tasks. The tasks are variations 
of the gridworld rooms example [68], where cells (locations) are the MDP states. 
The FourRooms environment is a 12 x 12 gridworld with four rooms. The target 
is in room four, and the start is in room one (from bottom left, to bottom right) 
with 20 portal entry locations. EightRooms is a larger variant with a 12 x 24 
gridworld divided into eight rooms. Here, the target is in room eight, and the 
starting location in room one, again with 20 portal entry locations. We show the 
two artificial tasks with sample trajectories in Fig. 5. 


Fig. 5. Examples of trajectories in the two artificial task environments with four (left) 
and eight (right) rooms. The initial position is indicated in red, the portal between the 
first and second room in yellow and the goal in green [45]. Blue squares indicate the 
path of the trajectory. (Color figure online) 


In this setting, the states do not have to be time-aware for ensuring stationary 
optimal policies but the unobserved used-up time introduces a random effect. 
The grid is divided into rooms. The agent’s goal is to reach a target from an 
initial state with the lowest number of steps. It has to cross different rooms, 
which are connected by doors, except for the first room, which is only connected 
to the second room by a portal. If the agent is at the portal entry cell of the 
first room, then it is teleported to a fixed portal arrival cell in the second room. 
The location of the portal entry cell is random for each episode, while the portal 
arrival cell is fixed across episodes. The portal entry cell location is given in the 
state for the first room. The portal is introduced to ensure that initialization with 
behavioral cloning (BC) alone is not sufficient for solving the task. It enforces 
that going to the portal entry cells is learned, even when they are at positions 
not observed in demonstrations. At every location, the agent can move up, down, 
left, right. The state transitions are stochastic. An episode ends after T = 200 
time steps. If the agent arrives at the target, then at the next step it goes into an 
absorbing state, where it stays until T = 200 without receiving further rewards. 
Reward is only given at the end of the episode. Demonstrations are generated 
by an optimal policy with an exploration rate of 0.2. 

The five steps of Align-RUDDER’s reward redistribution for these experi- 
ments are: 
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(i) Defining Events. Events are clusters of states obtained by Affinity 
Propagation using the successor representation based on demonstrations as 
similarity. Figure6 shows examples of clusters for the two versions of the 
environment. 

(ii) Determining the Alignment Scoring System. The scoring matrix 
is obtained according to (II), using € = 0 and setting all off-diagonal values 
of the scoring matrix to —1. 

(iti) Multiple sequence alignment (MSA). ClustalW is used for the MSA 
of the demonstrations with zero gap penalties and no biological options. 
(iv) Position-Specific Scoring Matrix (PSSM) and MSA profile 
model. The MSA supplies a profile model and a PSSM, as in (IV). 

(v) Reward Redistribution. Sequences generated by the agent are mapped 
to sequences of events according to (I). Reward is redistributed via differences 
of profile alignment scores of consecutive sub-sequences according to Eq. (3) 
using the PSSM. 


Fig. 6. Examples of different clusters in the FourRooms (left) and EightRooms (right) 
environment with 1% stochasticity on the transitions after performing clustering with 
Affinity Propagation using the successor representation with 25 demonstrations. Dif- 
ferent colors represent different clusters [45]. 


The reward redistribution determines sub-tasks like doors or portal arrival. 
Some examples are shown in Fig. 7. In these cases, three sub-tasks emerged. One 
for entering the portal and going to the first room, one for travelling from the 
entrance of one room to the exit of the next room, and finally going to the goal in 
the last room. The sub-tasks partition the Q-table into sub-tables that represent 
a sub-agent. The emerging set of sub-agents describe the global behavior of the 
Align-RUDDER method and can be directly used to explain the decision-making 
for specific tasks. 
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Fig. 7. Reward redistribution for the above trajectories in the FourRooms (left) and 
EightRooms (right) environments [45]. Here, sub-tasks emerged via reward redistribu- 


tion for entering the portal, travelling from the entrance of one room to the exit of the 
next and finally for reaching the goal. 


Results. In addition to enabling an interpretation of the strategy for solving a 
task, the redistributed reward signal speeds up the learning process of existing 
methods and requires fewer examples when compared to related approaches. All 
compared methods learn a Q-table and use an e-greedy policy with e = 0.2. The 
Q-table is initialized by behavioral cloning (BC). The state-action pairs which 
are not initialized, since they are not visited in the demonstrations, get an ini- 
tialization by drawing a sample from a normal distribution with mean 1 and 
standard deviation 0.5 (avoiding equal Q-values). Align-RUDDER learns the Q- 
table via RUDDER’s Q-value estimation (learning method (A) from above). For 
BC+Q, RUDDER (LSTM), SQIL [49], and DQ£D [26] a Q-table is learned by 
Q-learning. Hyperparameters are selected via grid search with a similar com- 
putational budget for each method. For different numbers of demonstrations, 
performance is measured by the number of episodes to achieve 80% of the aver- 
age return of the demonstrations. A Wilcoxon rank-sum test determines the 
significance of performance differences between Align-RUDDER and the other 
methods. 
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Fig. 8. Comparison of Align-RUDDER and other methods in the FourRooms (left) 
and EightRooms (right) environments with respect to the number of episodes required 
for learning on different numbers of demonstrations. Results are the average over 100 
trials. Align-RUDDER significantly outperforms all other methods [45]. 
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Figure8 shows the number of episodes required for achieving 80% of the 
average reward of the demonstrations for different numbers of demonstrations. In 
both environments, Align-RUDDER significantly outperforms all other methods, 
for < 10 demonstrations (with p-values of < 10~1° and < 10719 for Task (I) and 
(II), respectively). 


4.2 Minecraft 


To demonstrate the effectiveness of Align-RUDDER even in highly complex envi- 
ronments, it was applied to the complex high-dimensional problem of obtaining 
a diamond in Minecraft with the MineRL environment [19]. This task requires 
an agent to collect a diamond by exploring the environment, gathering resources 
and building necessary tools. To obtain a diamond the agent needs to collect 
resources (log, cobblestone, etc.) and craft tools (table, pickaxe, etc.). Every 
episode of the environment is procedurally generated, and the agent is placed at 
a random location. This is a challenging environment for reinforcement learning 
as episodes are typically very long, the reward signal is sparse and exploration 
difficult. By using demonstrations from human players, Align-RUDDER can cir- 
cumvent the exploration problem and with reward redistribution can ameliorate 
the sparse reward problem. Furthermore, by identifying sub-tasks, individual 
agents can be trained to solve simpler tasks, and help divide the complex long 
time-horizon task in more approachable sub-problems. In complement to that, 
we can also inspect and interpret the behavior of expert policies using Align- 
RUDDER’s alignment method. In our example, the expert policies are presented 
in the form of human demonstrations that successfully obtained a diamond. 
Align-RUDDER is able to extract a strategy from as few as ten trajectories. 
In the following, we outline the five steps of Align-RUDDER in the Minecraft 
environment. Furthermore, we inspect the alignment-based reward redistribu- 
tion and show how it enables interpretation of both the expert policies and the 
trained agent. 


(i) Defining Events. A state consists of a visual input and an inventory. 
Both inputs are normalized and then the difference of consecutive states is clus- 
tered, obtaining 19 clusters corresponding to events. Upon inspection these clus- 
ters correspond to inventory changes, i.e. gaining a particular item. Finally, the 
demonstration trajectories are mapped to sequences of events. This is shown in 
Fig. 9. 
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Fig. 9. Step (I): Define events and map demonstrations into sequences of events. 
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Fig. 10. Step (II): Construct a scoring matrix using event probabilities from demon- 
strations for diagonal elements and setting off-diagonal to a constant value. Darker 
colors signify higher score values. For illustration, only a subset of events is shown. 
(Color figure online) 


(ii) Determining the Alignment Scoring System. The scoring matrix is 
computed according to (II). Since there is no prior knowledge on how the individ- 
ual events are related to each other, the scoring matrix has the inverse frequency 
of an event occurring in the expert trajectories on the diagonal and a small con- 
stant value on the off-diagonal entries. As can be seen in Fig. 10, this results 
in lower scores for clusters corresponding to earlier events as they occur more 
often and high values for rare events such as building a pickaxe or mining the 
diamond. 


(iti) Multiple Sequence Alignment (MSA). The 10 expert episodes that 
obtained a diamond in the shortest amount of time are aligned using ClustalW 
with zero gap penalties and no biological options (i.e. arguments to ClustalW 
related to biological sequences). The MSA algorithm maximizes the pairwise sum 
of scores of all alignments using the scoring matrix from (II). Figure 11 shows 
an example of a such an alignment. 


XAI and Strategy Extraction via Reward Redistribution 197 


Expert Demonstrations 


H cero seeeccaacaoass< A A @ 6608 A 9 999 $ SSS AB 
SSSSGG6 O $ Aa 9 90 a 99 ooo Aag 
9 63969 ACIC $ SSC AB 

A @ 668808 7 99 $ ooo AB 
ee A@ 999 9 SSS AB 
eoee 10 c00e © ppe AB 
Multiple Sequence Alignment 
Z a o 999 $ oco AB 
LA aS 999 $ooco AB 
7 AG 9999 $ ọo20 ASB 
LA aS 999 oco AB 
7 ASSESS BSCS AB 
a Aa 9 99959 $ See AB 


Fig. 11. Step (III): Perform multiple sequence alignment (MSA) of the demonstra- 
tions. 


(iv) Position-Specific Scoring Matrix (PSSM) and MSA Profile 
Model. The multiple alignment gives a profile model and a PSSM. In Fig. 12 
an example of a PSSM is shown, resulting from an alignment of the previous 
example sequences. The PSSM contains for each position in the alignment the 
frequency of each event occurring in the trajectories used for the alignment. At 
this point, the strategy followed by the majority of experts is already visible. 
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Fig. 12. Step (IV): Compute a position-specific scoring matrix (PSSM). The score at 
a position from the MSA (column) and for an event (row) depends on the frequency 
of that event at that position in the MSA. For example, the event in the last position 
is present in all the sequences, and thus gets a high score at the last position. But it is 
absent in the remaining position, and thus gets a score of zero elsewhere. 


(v) Reward Redistribution. The reward is redistributed via differences of 
profile alignment scores of consecutive sub-sequences according to Eq. (3) using 
the PSSM. Figure 13 illustrates this on the example of an incomplete trajectory. 
In addition to aligning trajectories generated by an agent, we can use demon- 
strations from human players that were not able to obtain the diamond and 
therefore highlight problems those players have encountered. 
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Fig. 13. Step (V): A new sequence is aligned step by step to the profile model using 
the PSSM, resulting in an alignment score for each sub-sequence. The redistributed 
reward is then proportional to the difference of scores of subsequent alignments. 


Interpreting Agent Behavior. The strategy for obtaining a diamond, an 
example of which is shown in Fig. 13, is a direct result of Align-RUDDER. If 
it is possible to map event clusters to a meaningful representation, as is the 
case here by mapping the clusters to changes in inventory states, the strategy 
describes the behavior of the expert policies in a very intuitive and interpretable 
fashion. Furthermore, new trajectories generated by the learned agent can be 
aligned to the strategy, highlighting differences or problems where the trained 
agent is unable to follow the expert strategy. Inspecting the strategy it can be 
seen that random events, such as collecting dirt which naturally occurs when 
digging, are not present as they are not important for solving the task. Sur- 
prisingly, also items that seem helpful such as torches for providing light when 
digging are not used by the majority of experts even though they have to operate 
in near complete darkness without them. 


Results. Sub-agents can be trained for the sub-tasks extracted from the expert 
episodes. The sub-agents are first pre-trained on the expert episodes for the 
sub-tasks using BC, and further trained in the environment using Proximal Pol- 
icy Optimization (PPO) [57]. Using only 10 expert episodes, Align-RUDDER is 
able to learn to mine a diamond. A diamond is obtained in 0.1% of the cases, 
and to the best of our knowledge, no pure learning method? has yet mined a 
diamond [53]. With a 0.5 success probability for each of the 31 extracted sub- 
tasks?, the resulting success rate for mining the diamond would be 4.66 x 1071". 
Table 1 shows a comparison of methods on the Minecraft MineRL dataset by 
the maximum item score [37]. Results are taken from [37], in particular from 
Fig. 2, and completed by [33,53,61]. Align-RUDDER was not evaluated during 


? This includes not only learning to extract the sub-tasks, but also learning to solve 
the sub-tasks themselves. 

3 A 0.5 success probability already defines a very skilled agent in the MineRL envi- 
ronment. 
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Fig. 14. Comparing the consensus frequencies between behavioral cloning (BC, green), 
where fine-tuning starts, the fine-tuned model (orange), and human demonstrations 
(blue) [45]. The plot is in symmetric log scale (symlog in matplotlib). The mapping 
from the letters on the x-axis to items is as follows: S: log, P: plank, L: crafting table, 
V: stick, N: wooden pickaxe, A: cobblestone, Y: stone pickaxe, Q: iron ore, F: furnace, 
K: iron ingot, E: iron pickaxe, D: diamond ore. (Color figure online) 


the challenge, and may therefore have advantages. However, it did not receive the 
intermediate rewards provided by the environment that hint at sub-tasks, but 
self-discovered such sub-tasks, which demonstrates its efficient learning. Further- 
more, Align-RUDDER is capable of extracting a common strategy from only a 
few demonstrations and train globally explainable models based on this strategy 
(Fig. 14). 


5 Limitations 


While Align-RUDDER can extract strategies and speed up learning even in 
complex environments, the resulting performance depends on the quality of the 
alignment model. A low quality alignment model can be a result of multiple fac- 
tors, one of which is having many distinct events (>20). Clustering can be used 
to reduce the number of events, which could also lead to a low quality alignment 
model if too many relevant events are clustered together. While the optimal pol- 
icy does not change due to a poor alignment of expert episodes, the benefit of 
employing reward redistribution based on such an alignment diminishes. 

The alignment could fail if all expert episodes have different underlying 
strategies, i.e. no events are common in the expert episodes. We assume that 
the expert episodes follow the same underlying strategy, therefore they are simi- 
lar to each other and can be aligned. However, if an underlying strategy does not 
exist, then the alignment may fail to identify relevant events that should receive 
high redistributed rewards. In this case, reward is given at sequence end, when 
the redistributed reward is corrected, which leads to an episodic reward without 
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Table 1. Maximum item score of methods on the Minecraft task. Methods: Soft-Actor 
Critic (SAC, [20]), DQfD, Meta Learning Shared Hierarchies (MLSH, [17]), Rainbow 
[25], PPO, and BC. 


Method Team Name lag ® A gt A op eA 3g 
Align-RUDDER |Ours | | 
DQfD CDS 

BC MC_RL 

CLEAR ADS 

Options&PPO |CraftRL 

BC UEFDRL 

SAC TD240 

MLSH LAIR 

Rainbow Elytra 

PPO arolisram ei 


reducing the delay of the rewards and speeding up learning. This is possible, as 
there can be many distinct paths to the same end state. This problem can be 
resolved if there are at least two demonstrations of each of these different strate- 
gies. This helps with identifying events for all different strategies, such that the 
alignment will not fail. 

Align-RUDDER has the potential to reduce the cost for training and deploy- 
ing agents in real world applications, and therefore enable systems that have 
not been possible until now. However, the method relies on expert episodes 
and thereby expert decisions, which are usually strongly biased. Therefore, the 
responsible use of Align-RUDDER depends on a careful selection of the training 
data and awareness of the potential biases within those. 


6 Conclusion 


We have analyzed Align-RUDDER, which solves highly complex tasks with 
delayed and sparse rewards. The global behavior of agents trained by Align- 
RUDDER can easily be explained by inspecting the alignment of events. Fur- 
thermore, the alignment step of Align-RUDDER can be employed to explain 
arbitrary agents’ behavior, so long as episodes generated with this agent are 
available or can be generated. 

Furthermore, we have shown that Align-RUDDER outperforms state-of-the- 
art methods designed for learning from demonstrations in the regime of few 
demonstrations. On the Minecraft ObtainDiamond task, Align-RUDDER is, to 
the best of our knowledge, the first pure learning method to mine a diamond. 
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Abstract. Reinforcement learning is a promising strategy for automat- 
ically training policies for challenging control tasks. However, state-of- 
the-art deep reinforcement learning algorithms focus on training deep 
neural network (DNN) policies, which are black box models that are 
hard to interpret and reason about. In this chapter, we describe recent 
progress towards learning policies in the form of programs. Compared to 
DNNs, such programmatic policies are significantly more interpretable, 
easier to formally verify, and more robust. We give an overview of algo- 
rithms designed to learn programmatic policies, and describe several case 
studies demonstrating their various advantages. 


Keywords: Interpretable reinforcement learning - Program synthesis 


1 Introduction 


Reinforcement learning is a promising strategy for learning control policies for 
challenging sequential decision-making tasks. Recent work has demonstrated its 
promise in applications including game playing [34,43], robotics control [14,31], 
software systems [13,30], and healthcare [6,37]. A typical strategy is to build a 
high-fidelity simulator of the world, and then use reinforcement learning to train 
a control policy to act in this environment. This policy makes decisions (e.g., 
which direction to walk) based on the current state of the environment (e.g., 
the current image of the environment captured by a camera) to optimize the 
cumulative reward (e.g., how quickly the agent reaches its goal). 

There has been significant recent progress on developing powerful deep rein- 
forcement learning algorithms [33,41], which train a policy in the form of a deep 
neural network (DNN) by using gradient descent on the DNN parameters to opti- 
mize the cumulative reward. Importantly, these algorithms treat the underlying 
environment as a black box, making them very generally applicable. 
© The Author(s) 2022 
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A key challenge in many real-world applications is the need to ensure that 
the learned policy continues to act correctly once it is deployed in the real world. 
However, DNN policies are typically very difficult to understand and analyze, 
making it hard to make guarantees about their performance. The reinforcement 
learning setting is particularly challenging since we need to reason not just about 
isolated predictions but about sequences of highly connected decisions. 

As a consequence, there has been a great deal of recent interest in learning 
policies in the form of programs, called programmatic policies. Such policies 
include existing interpretable models such as decision trees [9], which are simple 
programs composed of if-then-else statements, as well as more complex ones such 
as state machines [26] and list processing programs [27,50]. In general, programs 
have been leveraged in machine learning to achieve a wide range of goals, such 
as representing high-level structure in images [16,17,25,46,47,53] and classifying 
sequence data such as trajectories or text [12,42]. 

Programmatic policies have a number of advantages over DNN policies that 
make it easier to ensure they act correctly. For instance, programs tend to be 
significantly more interpretable than DNNs; as a consequence, human experts 
can often understand and debug behaviors of a programmatic policy [26, 27,50]. 
In addition, in contrast to DNNs, programs have discrete structure, which make 
them much more amenable to formal verification [3,9,39], which can be used to 
prove correctness properties of programmatic policies. Finally, there is evidence 
that programmatic policies are more robust than their DNN counterparts—e.g., 
they generalize better to changes in the task or robot configuration [26]. 

A key challenge with learning programmatic policies is that state-of-the-art 
reinforcement learning algorithms cannot be applied. In particular, these algo- 
rithms are based on the principle of gradient descent on the policy parameters, 
yet programmatic policies are typically non-differentiable (or at least, their opti- 
mization landscape contains many local minima). As a consequence, a common 
strategy to learning these policies is to first learn the DNN policy using deep rein- 
forcement learning, and then using imitation learning to compress the DNN into 
a program. Essentially, this strategy reduces the reinforcement learning problem 
for programmatic policies into a supervised learning problem, for which efficient 
algorithms often exist—e.g., based on program synthesis [21]. A refinement of 
this strategy is to adaptively update the DNN policy to mirror the programmatic 
policy, which reduces the gap between the DNN and the program [26,49]. 

In this chapter, we provide an overview of recent progress in this direction. 
We begin by formalizing the reinforcement learning problem (Sect.2); then, 
we describe interesting kinds of programmatic policies that have been stud- 
ied (Sect. 3), algorithms for learning programmatic policies (Sect. 4), and case 
studies demonstrating the value of programmatic policies (Sect. 5). 


2 Background on Reinforcement Learning 


We consider a reinforcement learning problem formulated as a Markov decision 
process (MDP) M = (S, A, P, R) [86], where S is the set of states, A is the set 
of actions, P(s’ | a, s) € [0,1] is the probability of transitioning from state s € S 
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to state s’ € S upon taking action a € A, and R(s,a) € R is the reward accrued 
by taking action a in state s. 

Given an MDP M, our goal is to train an agent that acts in M in a way that 
accrues high cumulative reward. We represent the agent as a policy 7: S > A 
mapping states to actions. Then, starting from a state s € S, the agent selects 
action a = 7(s) according to the policy, observes a reward R(s, a), transitions to 
the next state s’ ~ P(- | s,a), and then iteratively continues this process starting 
from s’. For simplicity, we assume that a deterministic initial state sı E€ S along 
with a fixed, finite number of steps H € N. Then, we formalize the trajectory 
taken by the agent as a rollout ¢ € (S x A xR)", which is a sequence of state- 
action-reward tuples Ç = ((81,@1,11),.--;($4,@H,7H)). We can sample a rollout 
by taking r; = R(s:,a¢) and s¢41 ~ P(- | s+, at) for each t € [H] = {1,..., H}; we 
let D‘)(¢) denote the distribution over rollouts induced by using policy 7. 

Now, our goal is to choose a policy a € I in a given class of policies JI that 
maximizes the expected reward accrued. In particular, letting J(¢) = Sar rt 
be the cumulative reward of rollout ¢, our goal is to compute 


it = arg max J (r) where I(t) = Egpm[J(Q)], 
nell 
i.e., the policy m € IT that maximizes the expected cumulative reward over the 
induced distribution of rollouts D‘)(¢). 

As an example, we can model a robot navigating a room to reach a goal as 
follows. The state (x,y) € S = R? represents the robot’s position, and the action 
(v,¢) € A = R? represents the robot’s velocity v and direction ¢. The transition 
probabilities are P(s’ | s,a) = N(f(s,a), X), where 


f((z,y), (v, ¢)) = (w+ v-cosd-tT,y+u-sing-7), 


where 7 € Ryo is the time increment, and where X € R?*? is the variance in 
the state transitions due to stochastic perturbations. Finally, the rewards are 
the distance to the goal—i.e., R(s,a) = —||s — g||2 +à- |lall2, where g € R? is the 
goal and À € Ryo is a hyperparameter. Intuitively, the optimal policy 7 for this 
MDP takes actions in a way that maximizes the time the robot spends close to 
the goal g, while avoiding very large (and therefore costly) actions. 


3 Programmatic Policies 


The main difference in programmatic reinforcement learning compared to tra- 
ditional reinforcement learning is the choice of policy class JI. In particular, we 
are interested in cases where I is a space of programs of some form. In this 
section, we describe specific choices that have been studied. 


3.1 Traditional Interpretable Models 


A natural starting point is learning policies in the form of traditional inter- 
pretable models, including decision trees [10] and rule lists [52]. In particular, 
these models can be thought of as simple programs composed of simple primitives 
such as if-then-else rules and arithmetic operations. For example, in Fig. 1, we 
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Pole velocity > 0.29 | Right 
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Cart velocity > -0.43 Right 


Fa ve 


Pole angle > 0.07 Right 
no yes 
Right 
Fig. 1. A decision tree policy trained to control the cart-pole model; it achieves near- 
perfect performance. Adapted from [9]. 


show an example of a decision tree policy trained to control the cart-pole robot, 
which consists of a pole balanced on a cart and the goal is to move the cart back 
and forth to keep the pole upright [11]. Here, the state consists of the velocity 
and angle of each the cart and the pole (i.e., S C R*), and the actions are to 
move the cart left or right (ie., A = {left, right }). As we discuss in Sect. 5, these 
kinds of policies provide desirable properties such as interpretability, robustness, 
and verifiability. A key shortcoming is that they have difficulty handling more 
complex inputs, e.g., sets of other agents, sequences of observations, etc. Thus, 
we describe programs with more sophisticated components below. 


154 goal 
105 [] 
5 starl 
-i 
-25 0.0 
(a) Example (b) State machine policy 


of a rollout 


Fig. 2. (a) A depiction of the task, which is to drive the blue car (the agent) out from 
between the two stationary black cars. (b) A state machine policy trained to solve this 
task. Adapted from [26]. (Color figure online) 
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3.2 State Machine Policies 


A key shortcoming of traditional interpretable models is that they do not possess 
internal state—i.e., the policy cannot propagate information about the current 
time step to the next time step. In principle, for an MDP, keeping internal state 
is not necessary since the state variable contains all information necessary to 
act optimally. Nevertheless, in many cases, it can be helpful for the policy to 
keep internal state—for instance, for motions such as walking or swimming that 
repeat iteratively, it can be helpful to internally keep track of progress within the 
current iteration. In addition, if the state is partially observed (i.e., the policy 
only has access to o = h(s) instead of the full state s), then internal state may 
be necessary to act optimally [28]. In the context of deep reinforcement learning, 
recurrent neural networks (RNNs) can be used to include internal state [23]. 

For programmatic policies, a natural analog is to use polices based on finite- 
state machines. In particular, state machine policies are designed to be inter- 
pretable while including internal state [26]. Its internal state records one of a 
finite set of possible modes, each of which is annotated with (i) a simple pol- 
icy for choosing the action when in this mode (e.g., a linear function of the 
state), and (ii) rules for when to transition to the next mode (e.g., if some lin- 
ear inequality becomes satisfied, then transition to a given next mode). These 
policies are closely related to hybrid automata [2,24], which are models of a sub- 
class of dynamical systems called hybrid systems that include both continuous 
transitions (modeled by differential equations) and discrete, discontinuous ones 
(modeled by a finite-state machine). In particular, the closed-loop system con- 
sisting of a state-machine policy controlling a hybrid system is also a hybrid 
system. 

As an example, consider Fig. 2; the blue car (the agent) is parked between 
two stationary black cars, and its goal is to drive out of its parking spot into the 
goal position while avoiding collisions. The state is (x, y,0,d) € Rt, where (x,y) 
is the center of the car, 0 is its orientation, and d is the distance between the 
two black cars. The actions are (v, Y) € R?, where v is the velocity and w is the 
steering angle. The transitions are the standard bicycle dynamics [35]. 

In Fig. 2b, we show the state machine policy synthesized by our algorithm 
for this task. We use dy and dp to denote the distances between the agent and 
the front and back black cars, respectively. This policy has three different modes 
(besides a start mode m, and an end mode m,). Roughly speaking, it says (i) 
immediately shift from mode m, to mı, and drive the car forward and to the 
left, (ii) continue until close to the car in front; then, transition to mode mg, 
and drive the car backwards and to the right, (iii) continue until close to the car 
behind; then, transition back to mode m1, (iv) iterate between mı and mg until 
the car can safely exit the parking spot; then, transition to mode mg, and drive 
forward and to the right to make the car parallel to the lane. 
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Fig. 3. Two groups of agents (red vs. blue) at their initial positions (circles) trying 
to reach their goal positions (crosses). The solid line shows the trajectory taken by a 
single agent in each group. (Color figure online) 
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(a) DNN attention (b) Program attention 


Ra : argmax(map(—d’”, filter(0°” > —1.85, £))), 
R2 : random(filter(d’? > 3.41, £)). 


(c) Programmatic attention rules 


Fig. 4. (a) Soft attention computed by a DNN for the agent along the y-axis deciding 
whether to focus on the agent along the z-axis. (b) Sparse attention computed by a 


program. (c) Program used by each agent to select other agents to focus on. Adapted 
from [27]. 


3.3 List Processing Programs 


Another kind of programmatic policy is list processing programs, which are 
compositions of components designed to manipulate lists—e.g., the map, filter, 


Reinforcement Learning via Program Synthesis 213 


and fold operators [18]; the set of possible components can be chosen based on 
the application. In contrast to state machine policies, list processing programs 
are designed to handle situations where the state includes lists of elements. For 
example, in multi-agent systems, the full state consists of a list of states for each 
individual agent [27]. In this case, the program must compute a single action 
based on the given list of states. Alternatively, for environments with variable 
numbers of objects, the set of object positions must be encoded as a list. Finally, 
they can also be used to choose actions based on the history of the previous k 
states [50], which achieves a similar goal as state machine policies. 

As an example, consider the task in Fig.3, where agents in group 1 (blue) 
are navigating from the left to their goal on the right, while agents in group 2 
(red) are navigating from the right to their goal on the left. The system state 
s € R” is a list containing the position (a;, y;) of each agent of the k agents. An 
action a € R?* consists of the velocities (v;,w;) to be applied by each agent. We 
consider a strategy where we use a single policy 7: S x [k] — R?, which takes 
as input the system state along with the index of the current agent i € [k], and 
produces the action 7(s,7) to be taken by agent i. This policy is applied to each 
agent to construct the full list of actions. 

To solve this task, each agent must determine which agents to focus on; in 
the example in Fig. 3, it is useful to attend to the closest neighbor in the same 
group (to avoid colliding with them), as well as with an arbitrary agent from the 
opposite group (to coordinate so their trajectories do not collide). 

For now, we describe programmatic policies for each agent designed to select 
a small number of other agents to focus on. This list of agents can in principle be 
processed by a second programmatic policy to determine the action to choose; 
however, in Sect. 3.4, we describe a strategy that combines them with a neural 
network policy to select actions. Figure 4c shows an example of a programmatic 
policy that each agent can use to choose other agents to focus on for the task in 
Fig. 3. This program consists of two rules, each of which selects a single agent 
to focus on; the program returns the set consisting of both selected agents. In 
each of these rules, agent i is selecting over other agents j in the list 4; d‘/ is 
the distance between them and 6*/ is the angle between them. Intuitively, rule 
R, chooses the nearest other agent j such that 6° € [—1.85, 7], which is likely 
an agent in the same group as agent i that is directly in front of agent i; thus, 
agent 7 needs to focus on it to avoid colliding into it. In contrast, Rg chooses a 
random agent from the agents that are far away, which is likely an agent in the 
other group; thus, agent 7 can use this information to avoid the other group. 


3.4 Neurosymbolic Policies 


In some settings, we want part of the policy to be programmatic, but other 
parts of the policy to be DNNs. We refer to policies that combine programs and 
DNNs as neurosymbolic policies. Intuitively, the program handles part of the 
computation that we would like to be interpretable, whereas the DNN handles 
the remainder of the computation (potentially the part that cannot be easily 
approximated by an interpretable model). 
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One instance of this strategy is to leverage programs as the attention mech- 
anism for a transformer model [27]. At a high level, a transformer [48] is a DNN 
that operates on a list of inputs. These models operate by first choosing a small 
subset of other elements of the list to focus on (the attention layer), then uses a 
fully-connected layer to decide what information from the other agents is useful 
(the value layer), and finally uses a second fully-connected layer to compute the 
result (output layer). For example, transformers can be applied to multi-agent 
systems since it has to reason over the list of other agents. 

A neurosymbolic transformer is similar to a transformer but uses program- 
matic policies for the attention layer; the value layer and the output layer are still 
neural networks. This architecture makes the attention layer interpretable—e.g., 
it is easy to understand and visualize why an agent attends to another agent, 
while still retaining much of the complexity of the original transformer. 

For example, the program shown in Fig. 4c can be used to select other agents 
to attend to in a neurosymbolic transformer; unlike a DNN attention layer, 
this program is interpretable. An added advantage is that the program pro- 
duces sparse attention weights; in contrast, a DNN attention layer produces soft 
attention weights, so every agent needs to attend to every other agent, even if 
the attention weight is small. Figure 4a shows the soft attention computed by a 
DNN, and Fig. 4b shows the sparse attention computed by a program. 


4 Synthesizing Programmatic Policies 


Next, we describe our algorithms for training programmatic policies. We begin 
by describing the general strategy of first training a deep neural network (DNN) 
policy using deep reinforcement learning, and then using imitation learning in 
conjunction with the DNN policy to reduce the reinforcement learning problem 
for programmatic policies to a supervised learning problem (Sect. 4.1 and 4.2). 
Then, we describe a refinement of this strategy where the DNN is adaptively 
updated to better mirror the current programmatic policy (Sect. 4.3). Finally, 
all of these strategies rely on a subroutine for solving the supervised learning 
problem; we briefly discuss approaches to doing so (Sect. 4.4). 


4.1 Imitation Learning 


We focus on the setting of continuous state and action spaces (i.e., S C R” 
and A C R™), but our techniques are applicable more broadly. A number of 
algorithms have been proposed for computing optimal policies for a given MDP 
M and policy class IT [44]. For continuous state and action spaces, state-of-the- 
art deep reinforcement learning algorithms [33,41] consider a parameteric policy 
class IT = {79 | 0 € O}, where the parameters O C R? are real-valued—e.g., To 
is a DNN and @ are its parameters. Then, they compute 7* by optimizing over 
0. One strategy is to use gradient descent on the objective—ie., 


 — 0+: Vo (To). 
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Algorithm 1. Training programmatic policies using imitation learning. 
procedure IMITATIONLEARN(M, Q*, m,n) 

Train oracle policy * — TrainDNN(M) 

Initialize training dataset Z — Ø 

Initialize programmatic policy wo <— 7 

for i € {1,...,n} do 
Sample m trajectories to construct Z; — {(s,1*(s)) ~ D@-} 
Aggregate dataset Z — ZU Zi 
Train programmatic policy #; — TrainProgram(Z) 

end for 


return Best policy 7 € {71,..., õn } on cross validation 
end procedure 


In particular, the policy gradient theorem [45] encodes how to compute an unbi- 
ased estimator of this objective in terms of Voro. In general, most state-of-the- 
art approaches rely on gradient descent on the policy parameters 0. However, 
such approaches cannot be applied to training programmatic policies, since the 
search space of programs is typically discrete. 

Instead, a general strategy is to use imitation learning to reduce the rein- 
forcement learning problem to a supervised learning problem. At a high level, 
the idea is to first use deep reinforcement learning to learn an high-performing 
DNN policy 7*, and then train the programmatic policy 7 to imitate a*. 

A naive strategy is to use an imitation learning algorithm called behavioral 
cloning [4], which uses z* to explore the MDP, collects state-action pairs Z = 
{(s,a)} pairs occurring in rollouts ¢ ~ D"), and then trains 7 using supervised 
learning on the dataset Z—i.e., 


w = argmin 5 1(7(s) = a). (1) 


TEIM (s,ajEZ 


Intuitively, the key shortcoming with this approach is that if 7 makes a mistake 
compared to the DNN policy a*, then it might reach a state s that is very 
different from the states in the dataset Z. Thus, 7 may not know the correct 
action to take in state s, leading to poor performance. As a simple example, 
consider a self-driving car, and suppose 7* drives perfectly in the center of lane, 
whereas 7 deviates slightly from the center early in the rollout. Then, it reaches 
a state never seen in the training data Z, which means 7 does not know how to 
act in this state, so it may deviate further. 

State-of-the-art imitation learning algorithms are designed to avoid these 
issues. One simple but effective strategy is the Dataset Aggregation (DAGGER) 
algorithm [38], which iteratively retrains the programmatic policy based on the 
distribution of states it visits. The first iteration is the same as behavioral 
cloning; in particular, it generates an initial dataset Za using a* and trains 
an initial programmatic policy 7p. In each subsequent iteration i, it generates a 
dataset Z; using the previous programmatic policy 7;_1, and then trains 7; on Z;. 
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This strategy is summarized in Algorithm 1; it has been successfully leveraged 
to train programmatic policies to solve reinforcement learning problems [50]. 


4.2 Q-Guided Imitation Learning 


One shortcoming of Algorithm 1 is that it does not account for the fact that 
certain actions are more important than others [9]. Instead, the loss function 
in Eq.1 treats all state-action pairs in the dataset Z as being equally impor- 
tant. However, in practice, one state-action pair (s,a) may be significantly more 
consequential than another one (s’,a’)—i.e., making a mistake 7(s) Æ a might 
degrade performance by significantly more than a mistake 7(s’) £ a. 

For example, consider the toy game of Pong in Fig. 8; the goal is to move the 
paddle to prevent the ball from exiting the screen. Figure 5a shows a state where 
the action taken is very important; the paddle must be moved to the right, or 
else the ball cannot be stopped from exiting. In contrast, Fig. 5b shows a state 
where the action taken is unimportant. Ideally, our algorithm would upweight 
the former state-action pair and downweight the latter. 

One way to address this issue is by leveraging the Q-function, which measures 
the quality of a state-action pair—in particular, Q™ (s, a) € R is the cumulative 
reward accrued by taking action a in state s, and then continuing with policy m. 
Traditional imitation learning algorithms do not have access to Q'™"), since 7* is 
typically a human expert, and it would be difficult to elicit these values. However, 
the Q function Q(T” for the DNN policy 7* is computed as a byproduct of many 
deep reinforcement learning algorithms, so it is typically available in our setting. 
Given QC"), a natural alternative to Eq. 1 is 


f=argmin S> (Q°)(s,a) — Qs, #(s))). (2) 
nell 
(s,a)EZ 
Intuitively, the term Q (s, a) — Q (s, #(s)) measures the degradation in 
performance by taking the incorrect action 7(s) instead of a. Indeed, it can be 
proven that this objective exactly encodes the gap in performance between 7 
and 7*—i.e., in the limit of infinite data, it is equivalent to computing 


wt = argmin{J(n*) — J(7)}. 
nell 


Finally, a shortcoming of Eq.2 is that it is not a standard supervised learning 
problem. To address this issue, we can instead optimize the lower bound 


Qa) =Q (s,4(8)) < (a (s, a) - arg nin Q"1(s, 0) -1(%(s) = a), 


which yields the optimization problem 


a <argmin YO (Q(s,a) — argmin Q(s,0!)) 1649) =a). 


nell (s,a)ezZ acA 


This strategy is proposed in [9] and shown to learn significantly more compact 
policies compared to the original DAGGER algorithm. 


Reinforcement Learning via Program Synthesis 217 


4.3 Updating the DNN Policy 


Another shortcoming of Algorithm 1 is that it does not adjust the DNN policy 
m* to account for limitations on the capabilities of the programmatic policy 7. 
Intuitively, if 7 cannot accurately approximate 7*, then 7* may suggest actions 
that lead to states where 7 cannot perform well, even if 7* performs well in these 
states. There has been work on addressing this issue. For example, coaching can 
be used to select actions that are more suitable for 7 [22]. Alternatively, 7* can 
be iteratively updated using gradient descent to better reflect 7 [49]. 

A related strategy is adaptive teaching, where rather than choosing 7* to be 
a DNN, it is instead a policy whose structure mirrors that of 7 [26]. In this case, 
we can directly update m* on each training iteration to reflect the structure of 
az. As an example, in the case of state machine policies, 7* can be chosen to be a 
“loop-free” policy, which consists of a linear sequence of modes. These modes can 
then be mapped to the modes of 7, and regularized so that their local policies 
and mode transitions mirror that of 7. Adaptive teaching has been shown to be 
an effective strategy for learning state machine policies [26]. 


z= => rm, => 


(a) Critical state (b) Non-critical state 


Fig. 5. A toy game of Pong. The paddle is the gray bar at the bottom, and the ball 
is the gray square. The red arrow shows the direction the ball is traveling, and the 
blue arrows show the possible actions (move paddle left vs. right). We show examples 
where the action taken in this state (a) does, and (b) does not significantly impact the 
cumulative reward accrued from this state. (Color figure online) 


4.4 Program Synthesis for Supervised Learning 


Recall that imitation learning reduces the reinforcement learning problem for 
programmatic policies to a supervised learning problem. We briefly discuss algo- 
rithms for solving this supervised learning problem. In general, this problem is an 
instance of programming by example [19,20], which is a special case of program 
synthesis [21] where the task is specified by a set of input-output examples. In 
our setting, the input-output examples are the state-action pairs in the dataset 
Z used to train the programmatic policy at each iteration of Algorithm 1. 
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An added challenge applying program synthesis in machine learning settings 
is that traditional programming by example algorithms are designed to compute 
a program that correctly fits all of the training examples. In contrast, in machine 
learning, there typically does not exist a single program that fits all of the train- 
ing examples. Instead, we need to solve a quantitative synthesis problem where 
the goal is to minimize the number of errors on the training data. 

One standard approach to solving such program synthesis problems is to 
simply enumerate over all possible programmatic policies m € IT. In many cases, 
IT is specified as a context-free grammar, in which case standard algorithms can 
be used to enumerate programs in that grammar (typically up to a bounded 
depth) [5]. In addition, domain-specific techniques can be used to prune prov- 
ably suboptimal portions of the search space to speed up enumeration [12]. For 
particularly large search spaces, an alternative strategy is to use a stochastic 
search algorithm that heuristically optimzes the objective; for example, Metropo- 
lis Hastings can be used to adaptively sample programs (e.g., with the unnor- 
malized probability density function taken to be the objective value) [27,40]. 


154 goal 15 4 goal 154 goal 
10 | 10 i 10 4 [| 
5 start 5 start 54 start 
a a N 
—2.5 0.0 —2.5 0.0 -25 0.0 
(a) Original (b) Change (i) (c) Change (ii) 


Fig. 6. A human expert can modify our state machine policy to improve performance. 
(a) A trajectory using the original state machine policy shown in Fig. 2(b). (b) The 
human expert sets the steering angle to the maximum value 0.5. (c) The human expert 
sets the thresholds in the mode transitions so the blue car drives as close to the black 
cars as possible. Adapted from [26]. (Color figure online) 


5 Case Studies 


In this section, we describe a number of case studies that demonstrate the value 
of programmatic policies, demonstrating their interpretability (Sect. 5.1), verifi- 
ability (Sect. 5.2), and robustness (Sect. 5.3). 
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-3.6 


(c) 


Fig. 7. Visualization of the programmatic attention layer in Fig.4c, which has two 
rules Rı and Re. In this task, there are three groups of agents. The red circle denotes 
the agent currently choosing an action, the red cross denotes its goal, and the green 
circle denotes the agent selected by the rule. (a, b) Visualization of rule Ri for two 
different states; orange denotes the region where the filter condition is satisfied—i.e., 
Rı chooses a random agent in this region. (c) Visualization of rule R2, showing the 
score output by the map operator; darker values are higher—i.e., the rule chooses the 
agent with the darkest value. Adapted from [27]. (Color figure online) 


5.1 Interpretability 


A key advantage of programmatic policies is that they are interpretable [26, 27, 
50]. One consequence of their interpretability is that human experts can examine 
programmatic policies and modify them to improve performance. As an example, 
consider the state machine policy shown in Fig. 2b in Sect. 3. We have manually 
made the following changes to this policy: (i) increase the steering angle in mode 
mı to its maximum value 0.5 (so the car steers as much as possible when exit- 
ing the parking spot), and (ii) decrease the gap maintained between the agent 
and the black cars by changing the condition for transitioning from mode m; to 
mode mz to df < 0.1, and from mode mz to mode m; to dy < 0.1 (so the blue 
car drives as far as possible without colliding with a black car before changing 
directions). Figure6 visualizes the effects changes; in particular, it shows tra- 
jectories obtained using the original policy, the policy with change (i), and the 
policy with change (ii). As can be seen, the second modified policy exits the 
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parking spot more quickly than the original policy. There is no straightforward 
way to make these kinds of changes to improve a DNN policy. 

Similarly, we describe how it is possible to interpret programmatic attention 
layers in neurosymbolic transformers. In particular, Fig. 7 visualizes the synthe- 
sized programmatic attention policy described in Sect. 3 for a multi-agent control 
problem; in this example, there are three groups of agents, each trying to move 
towards their goals. Figures 7a & 7b visualize rule Rı in two different states. 
In particular, Rı selects a random far-away agent in the orange region to focus 
on. Note that in both states, the orange region is in the direction of the goal of 
the agent. Intuitively, the agent is focusing on an agent in the other group that 
is between itself and the goal; this choice enables the agent to plan a path to 
its goal that avoids colliding with the other group. Next, Fig. 7c visualizes rule 
Ro; this rule simply focuses on a nearby agent, which enables the agent to avoid 
collisions with other agents in the same group. 


5.2 Verification 


Another key advantage of programmatic policies is that they are significantly 
easier to formally verify. Intuitively, because they make significant use of discrete 
control flow structures, it is easier for formal methods to prune branches of the 
search space corresponding to unreachable program paths. 

Verification is useful when there is an additional safety constraint that must 
be satisfied by the policy in addition to maximizing cumulative reward. A com- 
mon assumption is that the agent should remain in a safe subset of the state 
space Szare C S during the entire rollout. Furthermore, in these settings, it 
is often assumed that the transitions are deterministic—i.e., the next state is 
s’ = f(s,a) for some deterministic transition function f : S x A — S. Finally, 
rather than considering a single initial state, we instead consider a subset of 
initial states S1 C Sgare. Then, we consider the safety constraint that for any 
rollout ¢ starting from sı € Sı, we have st E Sgafe for all t € [H]; we use 
o(m) € {true, false} to indicate whether a given policy 7 satisfies this constraint. 
Our goal is to solve 


n* = arg max J (r) where sate = {7 € H | o(r)}. 
TET sate 


A standard strategy for verifying safety is to devise a logical formula that encodes 
a safe rollout; in particular, we can encode our safety constraint as follows: 


A H 
plr) =Y7. | (s1 € S1) A N (ae = nls) A St41 = F(s 0))| = N (st € Ssate), 


t=1 t=1 


where 5’ = (s1, ..., SH ) are the free variables, and we use = to distinguish equality 
of logical formulas from equality of variables within a formula. Intuitively, this 
formula says that if (i) sı is an initial state, and (ii) the actions are chosen by 7 
and the transitions by f, then all states are safe. 
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(a) Recurrent region (b) Failure case 


Fig. 8. (a) A recurrent region; proving that the ball always returns to this region 
implies that the policy plays correctly for an infinite horizon. (b) A failure case found 
by verification, where the paddle fails to keep the ball from exiting. (Color figure online) 


With this expression for ¢(7), to prove safety, it suffices to prove that 
(7) = false. The latter equivalence is an instance of Satisfiability Modulo 
Theory (SMT), and can automatically be checked by an SMT solver [15] as long 
as predicates of the form s € Ssafe, a = 7(s), and s’ = f(s,a) can be expressed in 
a theory that is supported by the SMT solver. A standard setting is where Ssafe 
is a polytope, and m and f are piecewise affine; in these cases, each of these pred- 
icates can be expressed as conjunctions and disjunctions of linear inequalities, 
which are typically supported (e.g., the problem can be reduced to an integer 
program). 

As an example, this strategy has been used to verify that the decision tree 
policy for a toy game of pong shown in Fig.1 in Sect.3 is correct—i.e., that 
it successfully blocks the ball from exiting. In this case, we can actually prove 
correctness over an infinite horizon. Rather than prove that the ball does not 
exit in H steps, we instead prove that for any state sı where the ball is in the 
top half of the screen (depicted in blue in Fig. 8a), the ball returns to this region 
after H steps. If this property is true, then the ball never exits the screen. 

For this property, the SMT solver initially identified a failure case where the 
ball exits the screen, which is shown in Fig. 8b; in this corner case, the is at the 
very edge of the screen, and the paddle fails to keep the ball from exiting. This 
problem can be fixed by manually examining the decision tree and modifying it to 
correctly handle the failure case; the modified decision tree has been successfully 
proven to be correct—i.e., it always keeps the ball in the screen. 

In another example, we used bounded verification to verify that the state 
machine policy in Fig.2b does not result in any collisions for parallel parking 
task in Fig. 2a. We used dReach [29], an SMT solver designed to verify safety 
for hybrid systems, which are dynamical systems that include both continu- 
ous transitions (modeled using differential equations) and discrete, discontinu- 
ous ones (modeled using a finite-state machine). In particular, dReach performs 
bounded reachability analysis, where it unrolls the state machine modes up to 
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Fig. 9. A failure case found using our verification algorithm with tolerance parameter 
ô = 0.24 for the state machine policy in Fig. 2b on the parallel parking task in Fig. 2a. 
Here, the car collides with the car in the front. 


some bound. Furthermore, dReach is sound and 6-complete—i.e., if it says the 
system is safe, then it is guaranteed to be safe, and if it says the system is unsafe, 
then there exists some d-bounded perturbation that renders the system unsafe. 
Thus, we can vary 6 to quantify the robustness of the system to perturbations. 

With ô = 0.1, dReach proved that the policy in Fig. 2b is indeed safe for 
up to an unrolling of 7 modes of the state machine, which was enough for the 
controller to complete the task from a significant fraction of the initial state 
space. However, with 6 = 0.24, dReach identified a failure case where the car 
would collide with the car in the front (under some perturbations of the original 
model); this failure case is shown in Fig.9. We manually fixed this problem 
by inspecting the state machine policy in Fig.2b and modifying the switching 
conditions Gm? and Gh} to df < 0.5 and dẹ < 0.5, respectively. With these 
changes, dReach proved that the policy is safe for 6 = 0.24. 

More generally, similar strategies can be used to verify robustness and sta- 
bility of programmatic controllers [9,50]. It can also be extended to compute 
regions of attraction—for instance, to show that a decision tree policy provably 
stabilizes a pendulum to the origin [39]. To improve performance, one strategy 
is to compose a provably safe programmatic policy with a higher performing 
but potentially unsafe DNN policy using shielding [1,7,8,32,51]; intuitively, this 
strategy uses the DNN policy as long as the programmatic policy can ensure 
safety. Finally, the techniques so far have focused on safety after training the 
policy; in some settings, it can be desirable to continue running reinforcement 
learning after deploying the policy to adapt to changing environments. To enable 
safety during learning, one strategy is to prove safety while accounting for uncer- 
tainty in the current model of the environment [3]. 
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5.3 Robustness 


Another advantage of programmatic policies is that they tend to be more robust 
than DNN policies—i.e., they generalize well to states outside of the distribution 
on which the policy was trained. For example, it has been shown that a program- 
matic policy trained to drive a car along one race track can generalize to other 
race tracks not seen during training, while DNN policies trained in the same 
way do not generalize as well [50]. We can formalize this notion by considering 
separate training and test distributions over tasks—e.g., the training distribu- 
tion over tasks might include driving on just a single race track, whereas the 
test distribution includes driving on a number of additional race tracks. Then, a 
policy is robust if it performs well on the test distribution over tasks even when 
it is trained on the training distribution of tasks. 

A special case is inductive generalization, where the tasks are indexed by 
natural numbers 7 € N, the training distribution is over small 7, and the test 
distribution is over large i [26]. As a simple example, i may indicate the horizon 
over which the task is trained; then, a robust policy is one that is trained on 
short horizon tasks but generalizes to long horizon tasks. 

Going back to the parallel parking task from Fig. 2 in Sect. 3; for this task, 
we can consider inductively generalization of a policy in terms of the number of 
back-and-forth motions needed to solve the task [26]. In particular, Figs. 10a, 10b, 
and 10c depict training tasks with relatively few back-and-forth motions, and 
Fig. 10d depicts a test task with a much larger number of back-and-forth motions. 
As shown in Fig. 10e, a DNN policy trained using deep reinforcement learning can 
solve additional tasks from the training distribution; however, Fig. 10f shows that 
this policy does not generalize to tasks from the test distribution. In contrast, 
a state machine policy performs well on both additional tasks from the train- 
ing distribution (Fig. 10g) as well as tasks from the test distribution (Fig. 10h). 
Intuitively, the state machine policy is learning to the correct back-and-forth 
motion needed to solve the parallel parking problem. It can do so since (i) it is 
sufficiently expressive to represent the “correct” solution, yet (ii) it is sufficiently 
constrained that it learns a systematic policy. In contrast, the DNN policy can 
likely represent the correct solution, but because it is highly underconstrained, 
it finds an alternative solution that works on the training tasks, but does not 
generalize well to the test tasks. Thus, programmatic policies provide a promis- 
ing balance between expressiveness and structure needed to solve challenging 
control tasks in a generalizable way. 

For an illustration of these distinctions, we show the sequence of actions taken 
as a function of time by a programmatic policy compared to a DNN policy in 
Fig. 11. Here, the task is to fly a 2D quadcopter through an obstacle course 
by controlling its vertical acceleration. As can be seen, the state machine policy 
produces a smooth repeating pattern of actions; in contrast, the DNN policy acts 
highly erratically. This example further illustrates how programmatic policies 
are both complex (evidenced by the complexity of the red curve) yet structured 
(evidenced by the smoothness of the red curve and its repeating pattern). In 
contrast, DNN policies are expressive (as evidenced by the complexity of the red 
curve), but lack the structure needed to generalize robustly. 
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Fig. 10. (a, b, c) Training tasks for the autonomous driving problem in Fig. 2. (d) Test 
task, which is harder due to the increased number of back-and-forth motions required. 
(a) The trajectory taken by the DNN policy on a training task. (b) The trajectory 
taken by the DNN policy on a test task; as can be seen, it has several unsafe collisions. 
(c) The trajectory taken by the state machine policy (SMP) on a training task. (d) 
The trajectory taken by the SMP on a test task; as can be seen, it generalizes well to 
this task. Adapted from [26]. 
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Fig. 11. The vertical acceleration (i.e., action) selected by the policy as a function 
of time, for each our programmatic policy (red) and a DNN policy (blue), for a 2D 
quadcopter task. Adapted from [26]. (Color figure online) 
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6 Conclusions and Future Work 


In this chapter, we have describe an approach to reinforcement learning where we 
train programmatic policies such as decision trees, state machine policies, and list 
processing programs, instead of DNN policies. These policies can be trained using 
algorithms based on imitation learning, which first train a DNN policy using 
deep reinforcement learning and then train a programmatic policy to imitate 
the DNN policy. This strategy reduces the reinforcement learning problem to 
a supervised learning problem, that can be solved by existing algorithms such 
as program synthesis. Through a number of case studies, we have demonstrated 
that compared to DNN policies, programmatic policies are highly interpretable, 
are easier to formally verify, and generalized more robustly. 

We leave a number of directions for future work. One important challenge 
is that synthesizing programmatic policies remains costly. Many state-of-the-art 
program synthesis algorithms rely heavily on domain-specific pruning strategies 
to improve performance, including strategies targeted at machine learning appli- 
cations [12]. Leveraging these strategies can significantly increase the complexity 
of programmatic policies that can be learned in a tractable way. 

Another interesting challenge is scaling verification algorithms to more real- 
istic problems. The key limitation of existing approaches is that even if the pro- 
grammatic policy has a compact representation, the model of the environment 
often does not. A natural question in this direction is whether we can learn pro- 
grammatic models of the environment that are similarly easy to formally verify, 
while being a good approximation of the true environment. 

Finally, we have described one strategy for constructing neurosymbolic poli- 
cies that combine programs and DNNs—4.e., the neurosymbolic transformer. We 
believe a number of additional kinds of model compositions may be feasible—for 
example, leveraging a neural network to detect objects and then using a program 
to reason about them, or using programs to perform high-level reasoning such 
as path planning while letting a DNN policy take care of low-level control. 
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Abstract. Recent deep-learning models have achieved impressive pre- 
dictive performance by learning complex functions of many variables, 
often at the cost of interpretability. This chapter covers recent work 
aiming to interpret models by attributing importance to features and 
feature groups for a single prediction. Importantly, the proposed attri- 
butions assign importance to interactions between features, in addition 
to features in isolation. These attributions are shown to yield insights 
across real-world domains, including bio-imaging, cosmology image and 
natural-language processing. We then show how these attributions can 
be used to directly improve the generalization of a neural network or to 
distill it into a simple model. Throughout the chapter, we emphasize the 
use of reality checks to scrutinize the proposed interpretation techniques. 
(Code for all methods in this chapter is available at Ogithub.com/csinva 
and Qgithub.com/Yu-Group, implemented in PyTorch [54]). 


Keywords: Interpretability - Interactions - Feature importance - 
Neural network - Distillation 


1 Interpretability: For What and For Whom? 


Deep neural networks (DNNs) have recently received considerable attention 
for their ability to accurately predict a wide variety of complex phenomena. 
However, there is a growing realization that, in addition to predictions, DNNs 
are capable of producing useful information (i.e. interpretations) about domain 


C. Singh and W. Ha—Edqual contribution. 

We gratefully acknowledge partial support from NSF TRIPODS Grant 1740855, DMS- 
1613002, 1953191, 2015341, IIS 1741340, ONR grant N00014-17-1-2176, the Center for 
Science of Information (CSolI), an NSF Science and Technology Center, under grant 
agreement CCF-0939370, NSF grant 2023505 on Collaborative Research: Foundations 
of Data Science Institute (FODSI), the NSF and the Simons Foundation for the Collab- 
oration on the Theoretical Foundations of Deep Learning through awards DMS-2031883 
and 814639, and a grant from the Weill Neurohub. 

© The Author(s) 2022 


A. Holzinger et al. (Eds.): xxAI 2020, LNAI 13200, pp. 229-254, 2022. 
https://doi.org/10.1007/978-3-031-04083-2_12 


230 C. Singh et al. 


relationships contained in data. More precisely, interpretable machine learn- 
ing can be defined as “the extraction of relevant knowledge from a machine- 
learning model concerning relationships either contained in data or learned by 
the model” [50].+ 


Interaction / Improving models Real-world problems 
transformation with interpretations 
attributions 


© © © 
2.1 Scoring interactions 3.1 Explanation 4.1 Molecular 
(CD) regularization (CDEP) => partner prediction 
After evaluation 
= 
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- Sanity checks 
- Regularization 4.2 Cosmological 
3.2 Distillation with checks parameter prediction 
wavelets (AWD) 


2.2 Hierarchical 
interpretations 
(ACD) 


— | 


2.3 Transformation 


importance (TRIM) 4.3 Skin cancer 


classification 


Fig. 1. Chapter overview. We begin by defining interpretability and some of its desider- 
ata, following [50] (Sect. 1). We proceed to overview different methods for computing 
interpretations for interactions/transformations (Sect. 2), including for scoring interac- 
tions [49], generating hierarchical interpretations [68], and calculating importances for 
transformations of features [67]. Next, we show how these interpretations can be used 
to improve models (Sect. 3), including by directly regularizing interpretations [60] and 
distilling a model through interpretations [31]. Finally, we show how these interpreta- 
tions can be adapted to real-world applications (Sect. 4), including molecular partner 
prediction, cosmological parameter prediction, and skin-cancer classification. 


Here, we view knowledge as being relevant if it provides insight for a par- 
ticular audience into a chosen problem. This definition highlights that inter- 
pretability is poorly specified without the context of a particular audience and 
problem, and should be evaluated with the context in mind. This definition also 
implies that interpretable ML provides correct information (i.e. knowledge), and 
we use the term interpretation, assuming that the interpretation technique at 


1 We include different headings such as explainable AI (XATI), intelligible ML and 
transparent ML under this definition. 
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hand has passed some form of reality check (i.e. it faithfully captures some notion 
of reality). 

Interpretations have found uses both in their own right, e.g. medicine [41], 
policy-making [11], and science [5,77], as well as in auditing predictions them- 
selves in response to issues such as regulatory pressure [29] and fairness [22]. In 
these domains, interpretations have been shown to help with evaluating a learned 
model, providing information to repair a model (if needed), and building trust 
with domain experts [13]. However, this increasing role, along with the explo- 
sion in proposed interpretation techniques [4,27,31,50,53,75,81,84] has raised 
considerable concerns about the use of interpretation methods in practice [2,30]. 
Furthermore, it is unclear how interpretation techniques should be evaluated in 
the real-world context to advance our understanding of a particular problem. 
To do so, we first review some of the desiderata of interpretability, following 
[50] among many definitions [19, 40,63], then discuss some methods for critically 
evaluating interpretations. 


The PDR Desiderata for Interpretations. In general, it is unclear how to select 
and evaluate interpretation methods for a particular problem and audience. To 
help guide this process, we cover the PDR framework [50], consisting of three 
desiderata that should be used to select interpretation methods for a particu- 
lar problem: predictive accuracy, descriptive accuracy, and relevancy. Predictive 
accuracy measures the ability of a model to capture underlying relationships 
in the data (and generally includes different measures of a model’s quality of 
fit)—this can be seen as the most common form of reality check. In contrast, 
descriptive accuracy measures how well one can approximate what the model has 
learned using an interpretation method. Descriptive accuracy measures errors 
during the post-hoc analysis stage of modeling, when interpretations methods 
are used to analyze a fitted model. For an interpretation to be trustworthy, one 
should try to maximize both of the accuracies. In cases where either accuracy 
is not very high, the resulting interpretations may still be useful. However, it is 
especially important to check their trustworthiness through external validation, 
such as running an additional experiment. Relevancy guides which interpreta- 
tion to select based on the context of the problem, often playing a key role in 
determining the trade-off between predictive and descriptive accuracy; however, 
predictive accuracy and relevancy are not always a trade-off and the examples 
are shown in Sect. 4. 


Evaluating Interpretations and Additional Reality Checks. Techniques striving 
for interpretations can provide a large amount of fine-grained information, often 
not just for individual features but also for feature groups [49,68]. As such, it is 
important to ensure that this added information correctly reflects a model (i.e. 
has high descriptive accuracy), and can be useful in practice. This is challenging 
in general, but there are some promising directions. One direction, often used in 
statistical research including causal inference, uses simulation studies to evaluate 
interpretations. In this setting, a researcher defines a simple generative process, 
generates a large amount of data from that process, and trains their statistical 
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or ML model on that data. Assuming a proper simulation setup, a sufficiently 
relevant and powerful model to recover the generative process, and sufficiently 
large training data, the trained model should achieve near-perfect generalization 
accuracy. The practitioner then measures whether their interpretations recover 
aspects of the original generative process. If the simulation captures the reality 
well, then it can be viewed as a weaker form of reality check. 

Going a step further, interpretations can be tested by gathering new data in 
followup experiments or observations for retrospective validation. Another direc- 
tion, which this chapter also focuses on, is to demonstrate the interpretations 
through domain knowledge which is relevant to a particular domain/audience. To 
do so, we closely collaborate with domain experts and showcase how interpreta- 
tions can inform relevant knowledge in fundamental problems in cosmology and 
molecular-partner prediction. We highlight the use of reality checks to evaluate 
each proposed method in the chapter. 


Chapter Overview. A vast line of prior work has focused on assigning importance 
to individual features, such as pixels in an image or words in a document. Several 
methods yield feature-level importance for different architectures. They can be 
categorized as gradient-based [7,65,71,73], decomposition-based [6,51,66] and 
others [15,26,57,85], with many similarities among the methods [3,43]. While 
many methods have been developed to attribute importance to individual fea- 
tures of a model’s input, relatively little work has been devoted to understanding 
interactions between key features. These interactions are a crucial part of inter- 
preting modern deep-learning models, as they are what enable strong predictive 
performance on structured data. 

Here, we cover a line of work that aims to identify, attribute importance, 
and utilize interactions in neural networks for interpretation. We then explore 
how these attributions can be used to help improve the performance of DNNs. 
Despite their strong predictive performance, DNNs sometimes latch onto spu- 
rious correlations caused by dataset bias or overfitting [79]. As a result, DNNs 
often exploit bias regarding gender, race, and other sensitive attributes present 
in training datasets [20,28,52]. Moreover, DNNs are extremely computationally 
intensive and difficult to audit. 

Figure 1 shows an overview of this chapter. We first overview different 
methods for computing interpretations (Sect. 2), including for scoring inter- 
actions [49], generating hierarchical interpretations [68], and calculating impor- 
tances for transformations of features [67]. Next, we show how these interpreta- 
tions can be used to improve models (Sect. 3), including by directly regularizing 
interpretations [60] and distilling a model through interpretations [31]. Finally, 
we show how these interpretations can be adapted to real-world problems (Sect. 
4), including molecular partner prediction, cosmological parameter prediction, 
and skin-cancer classification. 
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2 Computing Interpretations for Feature Interactions 
and Transformations 


This section reviews three recent methods developed to extract the interactions 
between features that an (already trained) DNN has learned. First, Sect. 2.1 
shows how to compute importance scores for groups of features via contex- 
tual decomposition (CD), a method which works with LSTMs [49] and arbi- 
trary DNNs, such as CNNs [68]. Next, Sect. 2.2 covers agglomerative contextual 
decomposition (ACD), where a group-level importance measure, in this case CD, 
is used as a joining metric in an agglomerative clustering procedure. Finally, Sect. 
2.3 covers transformation importance (TRIM), which allows for computing scores 
for interactions on transformations of a model’s input. Other methods have been 
recently developed for understanding model interactions with varying degrees of 
computational cost and faithfulness to the trained model [17,18, 75, 76, 78, 83]. 


2.1 Contextual Decomposition (CD) Importance Scores for General 
DNNs 


Contextual decomposition breaks up the forward pass of a neural network in 
order to find an importance score of some subset of the inputs for a particular 
prediction. For a given DNN f(x), its output is represented as a SoftMax opera- 
tion applied to logits g(a). These logits, in turn, are the composition of L layers 
gi, i = 1,..., L, such as convolutional operations or ReLU non-linearities: 


f(x) = SoftMax(g(a)) = SoftMax(gz (gr—1(---(g2(91(«))))))- (1) 


Given a group of features {;}jes, the CD algorithm, g°?(x), decomposes the 
logits g(x) into a sum of two terms, B(x) and y(x). B(x) is the importance 
measure of the feature group {z£;}jes, and y(x) captures contributions to g(x) 
not included in G(x). 


9°? (x) = (8(2), y(2)), (2) 
= 9(2). (3) 


Computing the CD decomposition for g(a), requires layer-wise CD decomposi- 
tions gP (x) = (Bi, yi) for each layer g;(x), where g;(x) represents the vector 
of neural activations at the i-th layer. Here, @; corresponds to the importance 
measure of {z;}jes to layer i, and +; corresponds to the contribution of the rest 
of the input to layer i. Maintaining the decomposition requires 3; + yi = g:(x) 
for each i, the CD scores for the full network are computed by composing these 
decompositions. 


9°? (2) = 96 (95210-93? (97° (2))))). (4) 
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Note that the above equation shows the CD algorithm g@? takes as input a 
vector x and for each layer it outputs the pair of vector scores g°?(x) = (Bi, yi); 
and the final output is given by a pair of numbers g°? (x) = (G(x), y(£)) such 
that the sum 6(x) + y(x) equals the logits g(x). 

The initial CD work [49] introduced decompositions gf? for layers used in 
LSTMs and the followup work [68] for layers used in CNNs and more generic 
deep architectures. Below, we give example decompositions for some commonly 
used layers, such as convolutional layer, linear layer, or ReLU activation. 

When g; is a convolutional or fully connected layer, the layer operation con- 
sists of a weight matrix W and a bias vector b. The weight matrix can be multi- 
plied with @;-; and 7-1 individually, but the bias must be partitioned between 
the two. The bias is partitioned proportionally based on the absolute value of 
the layer activations. For the convolutional layer, this equation yields only one 
activation of the output; it must be repeated for each activation. 


|W bi-1] 
bi = W bi- l 5 
E [WBa] + Wy] ©) 

|Wyi-1| 
eWt D. 6 
i Ha |W bi] + [Wail] 6) 


Next, for the ReLU activation function,” importance score (3; is computed 
as the activation of 3;_; alone and then update y; by subtracting this from the 
total activation. 


Bi = ReLU(G;_1); (7) 
y= ReLU(G;_1 + Yi—1) — ReLU((;_1). (8) 


For a dropout layer, dropout is simply applied to (;-, and yj _1 individually. 
Computationally, a CD call is comparable to a forward pass through the net- 
work f. 


Reality Check: Identifying Top-Scoring Phrases. When feasible, a com- 
mon means of scrutinizing what a model has learned is to inspect its most 
important features and interactions. Table 1 shows the ACD-top-scoring phrases 
of different lengths for an LSTM trained on SST (here the phrases are considered 
from all sentences in the SST’s validation set). These phrases were extracted by 
running ACD separately on each sample in validation set. The score of each 
phrase was then computed by averaging over the score it received in each occur- 
rence in an ACD hierarchy. The extracted phrases are clearly reflective of the 
corresponding sentiment, providing additional evidence that ACD is able to cap- 
ture meaningful positive and negative phrases. The paper [49] also shows that 
CD properly captures negation interactions for phrases. 


? See [49, Sect. 3.2.2] for other activation functions such as sigmoid or hyperbolic 
tangent. 
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Table 1. Top-scoring phrases of different lengths extracted by CD on SST’s validation 
set. The positive/negative phrases identified by CD are all indeed positive/negative. 


Length | Positive Negative 

1 Pleasurable, glorious Nowhere, grotesque, sleep 

3 Amazing accomplishment, great Bleak and desperate, conspicuously 
fun lacks 

5 A pretty amazing accomplishment Ultimately a pointless endeavour 


2.2 Agglomerative Contextual Decomposition (ACD) 


Next, we cover agglomerative contextual decomposition (ACD), a general tech- 
nique that can be applied to a wide range of DNN architectures and data types. 
Given a prediction from a trained DNN, ACD produces a hierarchical clustering 
of the input features, along with the contribution of each cluster to the final 
prediction. This hierarchy is designed to identify clusters of features that the 
DNN learned are predictive. Throughout this subsection, we use the term CD 
interaction score between two groups of features to mean the difference between 
the scores of the combined group and the original groups. 

Given the generalized CD scores introduced above, we now introduce the 
clustering procedure used to produce ACD interpretations. At a high level, this 
method is equivalent to agglomerative hierarchical clustering, where the CD 
interaction score is used as the joining metric to determine which clusters to join 
at each step. This procedure builds the hierarchy by starting with individual 
features and iteratively combining them based on the highest interaction scores 
provided by CD. The displayed ACD interpretation is the hierarchy, along with 
the CD importance score at each node. 

The clustering procedure proceeds as follows. After initializing by computing 
the CD scores of each feature individually, the algorithm iteratively selects all 
groups of features within k% of the highest-scoring group (where k is a hyperpa- 
rameter) and adds them to the hierarchy. Each time a new group is added to the 
hierarchy, a corresponding set of candidate groups is generated by adding indi- 
vidual contiguous features to the original group. For text, the candidate groups 
correspond to adding one adjacent word onto the current phrase, and for images 
adding any adjacent pixel onto the current image patch. Candidate groups are 
ranked according to the CD interaction score, which is the difference between 
the score of the candidate and the original groups. 


Reality Check: Human Experiment. Human experiments show that ACD 
allows users to better reason about the accuracy of DNNs. Each subject was 
asked to fill out a survey asking whether, using ACD, they could identify the 
more accurate of two models across three datasets (SST [70], MNIST [36] and 
ImageNet [16]), and ACD was compared against three baselines: CD [49], Inte- 
grated Gradients (IG) [73], and occlusion [38,82]. Each model uses a standard 
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DNN Prediction ACD Interpretation 
negative AO WOR) BEDE 
| Positive 
= 
DNN = — 
legative 


Fig. 2. ACD illustrated through the toy example of predicting the phrase “not very 
good” as negative. Given the network and prediction, ACD constructs a hierarchy 
of meaningful phrases and provides importance scores for each identified phrase. In 
this example, ACD identifies that “very” modifies “good” to become the very positive 
phrase “very good”, which is subsequently negated by “not” to produce the negative 
phrase “not very good”. 


architecture that achieves high classification accuracy, and has an analogous 
model with substantially poorer performance obtained by randomizing some 
fraction of its weights while keeping the same predicted label. The objective of 
this experiment was to determine if subjects could use a small number of inter- 
pretations produced by ACD to identify the more accurate of the two models 
(Fig. 2). 

For each question, 11 subjects were given interpretations from two different 
models (one high-performing and one with randomized weights), and asked to 
identify which of the two models had a higher generalization accuracy. To prevent 
subjects from simply selecting the model that predicts more accurately for the 
given example, for each question a subject is shown two sets of examples: one 
where only the first model predicts correctly and one where only the second 
model predicts correctly (although one model generalizes to new examples much 
better). 

Figure 3 shows the results of the survey. For SST, humans were better able to 
identify the strongly predictive model using ACD compared to other baselines, 
with only ACD and CD outperforming random selection (50%). Based on a one- 
sided two-sample t-test, the gaps between ACD and IG/Occlusion are significant, 
but not the gap between ACD and CD. In the simple setting of MNIST, ACD 
performs similarly to other methods. When applied to ImageNet, a more complex 
dataset, ACD substantially outperforms prior, non-hierarchical methods, and is 
the only method to outperform random chance. The paper [68] also contains 
results showing that the ACD hierarchy is robust to adversarial perturbations. 
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Fig. 3. Results for human studies. Binary accuracy for whether a subject correctly 
selected the more accurate model using different interpretation techniques. 


2.3 Transformation Importance with Applications to Cosmology 
(TRIM) 


Both CD and ACD show how to attribute importance to interactions between 
features. However, in many cases, raw features such as pixels in an image or words 
in a document may not be the most meaningful spaces to perform interpretation. 
When features are highly correlated or features in isolation are not semantically 
meaningful, the resulting attributions need to be improved. 

To meet this challenge, TRIM (Transformation Importance) attributes 
importance to transformations of the input features (see Fig. 4). This is crit- 
ical for making interpretations relevant to a particular audience/problem, as 
attributions in a domain-specific feature space (e.g. frequencies or principal com- 
ponents) can often be far more interpretable than attributions in the raw feature 
space (e.g. pixels or biological readings). Moreover, features after transformation 
can be more independent, semantically meaningful, and comparable across data 
points. The work here focuses on combining TRIM with CD, although TRIM 
can be combined with any local interpretation method. 


TRIM(s) 


Fig. 4. TRIM: attributing importance to a transformation of an input Tọ(x) given a 
model f(x). 
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TRIM aims to interpret the prediction made by a model f given a single 
input xz. The input x is in some domain 1, but we desire an explanation for its 
representation s in a different domain S, defined by a mapping T : ¥ — S, such 
that s = T(x). For example, if x is an image, s may be its Fourier representation, 
and T would be the Fourier transform. Notably, this process is entirely post-hoc: 
the model f is already fully trained on the domain æ. By reparametrizing the 
network as shown in Fig. 4, we can obtain attributions in the domain S. If we 
require that the mapping T be invertible, so that x = T~(s), we can represent 
each data point x with its counterpart s in the desired domain, and the function 
to interpret becomes f’ = fo T-t; the function f’ can be interpreted with 
any existing local interpretation method attr (e.g. LIME [57] or CD [49,68])). 
Note that if the transformation T is not perfectly invertible (i.e. z Æ 2’), then 
the residuals x — x’ may also be required for local interpretation. For example, 
they are required for any gradient-based attribution method to aid in computing 
Of'/ds.° Once we have the reparameterized function f’(s), we need only specify 
which part of the input to interpret, before calculating the TRIM score: 


Definition 1. Given a model f, an input x, a mask M, a transformation T, 
and an attribution method attr, 


TRIM(s) = attr (f'; s) 
where f! = foT~',s=MOT(z) 


Here M is a mask used to specify which parts of the transformed space to interpret 
and © denotes elementwise multiplication. 


In the work here, the choice of attribution method attr is CD, and 
attr (f; x', x) represents the CD score for the features x’ as part of the input z. 
This formulation does not require that x’ simply be a binary masked version of 
x; rather, the selection of the mask M allows a human/domain scientist to decide 
which transformed features to score. In the case of image classification, rather 
than simply scoring a pixel, one may score the contribution of a frequency band 
to the prediction f(x). This general setup allows for attributing importance to a 
wide array of transformations. For example, T could be any invertible transform 
(e.g. a wavelet transform), or a linear projection (e.g. onto a sparse dictionary). 
Moreover, we can parameterize the transformation Tg and learn the parameters 
0 to produce a desirable representation (e.g. sparse or disentangled). 

As a simple example, we investigate a text-classification setting using TRIM. 
We train a 3-layer fully connected DNN with ReLU activations on the Kaggle 
Fake News dataset,* achieving a test accuracy of 94.8%. The model is trained 
directly on a bag-of words representation, but TRIM can provide a more succinct 
space via a topic model transformation. The topic model is learned via latent 
dirichlet allocation [10], which provides an invertible linear mapping between a 


3 If the residual is not added, the gradient of f’ = f o T7! requires Of /Ox|,, which 
can potentially cause evaluation of f at the out-of-distribution examples x’ 4 x. 
4 https: //www.kagegle.com/c/fake-news/overview. 
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document’s bag-of-words representation and its topic-representation, where each 
topic assigns different linear weights to each word. Figure 5 shows the mean 
attributions for different topics when the model predicts Fake. Interestingly, the 
topic with the highest mean attribution contains recognizable words such as 
clinton and emails. 


like just people time don know way life make good 4 

people world black israel political state women war america students 4 

news twitter com 2016 media facebook 2017 breitbart video Ha + 

russia united states government russian war foreign china president military 4 
said police people state city syria attack officers killed military 4 

said mr ms new la like york year city years 4 

mr said trump president ms court new mrs campaign house 4 

said percent new year company million 000 money years companies 4 

trump president obama donald people house election said party white 7 


clinton hillary election campaign fbi trump emails investigation comey email 4 eS) 


-250 [°] 250 500 750 1000 1250 
Mean TRIM Score (CD) 


Fig. 5. TRIM attributions for a fake-news classifier based on a topic model transforma- 
tion. Each row shows one topic, labeled with the top ten words in that topic. Higher 
attributions correspond to higher contribution to the class fake. Calculated over all 
points which were accurately classified as fake in the test set (4,160 points). 


Simulation. In the case of a perfectly invertible transformation, such as the 
Fourier transform, TRIM simply measures the ability of the underlying attribu- 
tion method (in this case CD) to correctly attribute importance in the trans- 
formed space. We run synthetic simulations showing the ability of TRIM with 
CD to recover known groundtruth feature importances. Features are generated 
i.i.d. from a standard normal distribution. Then, a binary classification outcome 
is defined by selecting a random frequency and testing whether that frequency 
is greater than its median value. Finally, we train a 3-layer fully connected DNN 
with ReLU activations on this task and then test the ability of different methods 
to assign this frequency the highest importance. Table 2 shows the percentage 
of errors made by different methods in such a setup. CD has the lowest error on 
average, compared to popular baselines. 


Table 2. Error (%) in recovering a groundtruth important frequency in simulated data 
using different attribution methods with TRIM, averaged over 500 simulated datasets. 


CD DeepLift [66] | SHAP [43] | Integrated gradients [73] 


0.4 + 0.282 | 3.6 + 0.833 | 4.0 + 0.897 | 4.2 + 0.876 
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3 Using Attributions to Improve Models 


This section shows two methods for using the attributions introduced in Sect. 
2 to directly improve DNNs. Section 3.1 shows how CD scores can be penalized 
during training to improve generalization in interesting ways and Sect. 3.2 shows 
how attribution scores can be used to distill a DNN into a simple data-driven 
wavelet model. 


3.1 Penalizing Explanations to Align Neural Networks with Prior 
Knowledge (CDEP) 


While much work has been put into developing methods for explaining DNNs, 
relatively little work has explored the potential to use these explanations to help 
build a better model. Some recent work proposes forcing models to attend to 
certain regions [12,21,48], penalizing the gradients or expected gradients of a 
neural network [8,21,23,42,61,62], or using layer-wise relevance propagation to 
prune/improve models [72,80]. A newly emerging line of work investigates how 
domain experts can use explanations during the training loop to improve their 
models (e.g. [64]). 

Here, we cover contextual decomposition explanation penalization (CDEP), 
a method which leverages CD to enable the insertion of domain knowledge into 
a model [60]. Given prior knowledge in the form of importance scores, CDEP 
works by allowing the user to directly penalize importances of certain features 
or feature interactions. This forces the DNN to not only produce the correct 
prediction, but also the correct explanation for that prediction. CDEP can be 
applied to arbitrary DNN architectures and is often orders of magnitude faster 
and more memory efficient than recent gradient-based methods [23,62]; CDEP 
offers significant computational improvements, since, unlike gradient-based attri- 
butions, the CD score is computed along the forward pass, only first derivatives 
are required for optimization, early layers can be frozen, and all activations of a 
DNN do not need to be cached to perform backpropagation; furthermore, with 
gradient-based methods the training requires the storage of activations and gra- 
dients for all layers of the network as well as the gradient with respect to the 
input, whereas penalizing CD requires only a small constant amount of memory 
more than standard training. 

CDEP works by augmenting the traditional objective function used to train a 
neural network, as displayed in Eq. (9) with an additional component. In addition 
to the standard prediction loss £, which teaches the model to produce the correct 
predictions by penalizing wrong predictions, we add an explanation error Lexpl, 
which teaches the model to produce the correct explanations for its predictions 
by penalizing wrong explanations. In place of the prediction and labels fo(X), y, 
used in the prediction error £, the explanation error £.,,) uses the explanations 
produced by an interpretation method expl,(X), along with targets provided by 
the user exply. The two losses are weighted by a hyperparameter AÀ € R: 
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Prediction error Explanation error 
a 
ô = argmin Z(fo(X),y) +A Lexpi (exply(X), explx) 
0 


CDEP uses CD as the explanation function used to compute expl (X), allow- 
ing the penalization of interactions between features. We now substitute the 
above CD scores into the generic equation in Eq. (9) to arrive at CDEP as it is 
used in this chapter. We collect from the user, for each input x;, a collection of 
feature groups z; s, 71 € R, S C {1,...,d}, along with explanation target values 
expl,, ,, and use the ||- || loss for Lexpi. This yields a vector £(x;) for any subset 
of features in an input 2; which we would like to penalize. We can then collect 
prior knowledge label explanations for this subset of features, expl,, and use it 
to regularize the explanation: 


Prediction error Explanation error 


OO r — 
ĝ= argmin XOY- vic log folti)e +A XC YO ||8(ai,s) — expla; sll (10) 
i e i S 


In the above, i indexes each individual example in the dataset, S indexes a 
subset of the features for which we penalize their explanations, and c sums over 
each class. 

The choice of prior knowledge explanations exply is dependent on the appli- 
cation and the existing domain knowledge. CDEP allows for penalizing arbitrary 
interactions between features, allowing the incorporation of a very broad set of 
domain knowledge. In the simplest setting, practitioners may precisely provide 
prior knowledge human explanations for each data point. To avoid assigning 
human labels, one may utilize programmatic rules to identify and assign prior 
knowledge importance to regions, which are then used to help the model iden- 
tify important/unimportant regions. In a more general case, one may specify 
importances of different feature interactions. 


Towards Reality Check: ColorMNIST Task. Here, we highlight CDEP’s 
ability to alter which features a DNN uses to perform digit classification. Similar 
to one previous study [39], we alter the MNIST dataset to include three color 
channels and assign each class a distinct color, as shown in Fig. 6. An unpenalized 
DNN trained on this biased data will completely misclassify a test set with 
inverted colors, dropping to 0% accuracy (see Table 3), suggesting that it learns 
to classify using the colors of the digits rather than their shape. 

Interestingly, this task can be approached by minimizing the contribution of 
pixels in isolation (which only represent color) while maximizing the importance 
of groups of pixels (which can represent shapes). To do this, CDEP penalizes the 
CD contribution of sampled single-pixel values, following Eq. (10). Minimizing 
the contribution of single pixels encourages the DNN to focus instead on groups 
of pixels. Table3 shows that CDEP can partially divert the network’s focus on 
color to also focus on digit shape. The table includes 2 baselines: penalization 
of the squared gradients (RRR) [62] and Expected Gradients (EG) [23]. The 
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Fig. 6. ColorMNIST: the shapes remain the same between the training set and the 
test set, but the colors are inverted. (Color figure online) 


baselines do not improve the test accuracy of the model on this task above the 
random baseline, while CDEP significantly improves the accuracy to 31.0%. 


Table 3. Test Accuracy on ColorMNIST. CDEP is the only method that captures and 
removes color bias. All values averaged over thirty runs. Predicting at random yields 
a test accuracy of 10%. 


Vanilla | CDEP | RRR Expected gradients 
ColorMNIST | 0.2 + 0.2 31.0 + 2.3 | 0.2 + 0.1 | 10.0 + 0.1 


The paper [60] further shows how CDEP can be applied to diverse applica- 
tions, such as notions of fairness in the COMPAS dataset [35] and in natural- 
language processing. 


3.2 Distilling Adaptive Wavelets from Neural Networks with 
Interpretations 


One promising approach to acquiring highly predictive interpretable models is 
model distillation. Model distillation is a technique which distills the knowledge 
in one model into another model. Here, we focus on the case where we distill 
a DNN into a simple, wavelet model. Wavelets have many useful properties, 
including fast computation, an orthonormal basis, and interpretation in both 
spatial and frequency domains [44]. Here, we cover adaptive wavelet distillation 
(AWD), a method to learn a valid wavelet by distilling information from a trained 
DNN [31]. 

Equation (11) shows the three terms in the formulation of the method. zx; 
represents the i-th input signal, Z; represents the reconstruction of 7;, h and g 
represent the lowpass and highpass wavelet filters, and Wx; denotes the wavelet 
coefficients of x;. A is a hyperparameter penalizing the sparsity of the wavelet 
coefficients, which can help to learn a compact representation of the input signal 
and y is a hyperparameter controlling the strength of the interpretation loss, 
which controls how much to use the information coming from a trained model f: 
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1 
minimize L(h,g) md lv: — Bll => W (h, g, xi; A) +7 X |TRIM;(¥zx:)|l1, 
i 3% i 
Reconstruction loss Wavelet loss Interpretation loss 
(11) 


Here the reconstruction loss ensures that the wavelet transform is invertible, 
allowing for reconstruction of the original data. Hence the transform does not 
lose any information in the input data. 

The wavelet loss ensures that the learned filters yield a valid wavelet trans- 
form. Specifically, [45,47] characterize the sufficient and necessary conditions on 
h and g to build an orthogonal wavelet basis. Roughly speaking, these conditions 
state that in the frequency domain the mass of the lowpass filer h is concentrated 
on the range of low frequencies while the highpass filter g contains more mass in 
the high frequencies. We also desire the learned wavelet to provide sparse repre- 
sentations so we add the @; norm penalty on the wavelet coefficients. Combining 
all these conditions via regularization terms, we define the wavelet loss at the 
data point x; as 


W (h, g, zi; A) = AllWaalla + È hin] - +È oi) + (llall — 1)? 
+S (PP + Alw + 2)/? — 2)? + SO(9- hinh[n — 2k] — 1-0)’, 
w k n 


where g is set as g[n] = (—1)”h|N — 1 — n] and where N is the support size of 
h (see [31] for further details on the formulations of wavelet loss). 

Finally, the interpretation loss enables the distillation of knowledge from the 
pre-trained model f into the wavelet model. It ensures that attributions in the 
space of wavelet coefficients Wx; are sparse, where the attributions of wavelet 
coefficients is calculated by TRIM, as described in Sect. 2.3. This forces the 
wavelet transform to produce representations that concisely explain the model’s 
predictions at different scales and locations. 

A key difference between AWD and existing adaptive wavelet techniques 
(e.g. [55,56]) is that they use interpretations from a trained model to learn the 
wavelets; this incorporates information not just about the signal but also an 
outcome of interest and the inductive biases learned by a DNN. This can help 
learn an interpretable representation that is well-suited to efficient computation 
and effective prediction. 


Reality Check: Molecular Partner Prediction. For evaluation, see Sect. 
4.1, which shows an example of how a distilled AWD model can provide a simpler, 
more interpretable model while improving prediction accuracy. 
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4 Real-Data Problems Showcasing Interpretations 


In this section, we focus on three real-data problems where the methods intro- 
duced in Sect. 2 and Sect. 3 are able to provide useful interpretations in context. 
Sect. 4.1 describes how AWD can distill DNNs used in cell biology, Sect. 4.2 
describes how TRIM + CD yield insights in a cosmological context, and Sect. 
4.3 describes how CDEP can be used to ignore spurious correlations in a medical 
imaging task. 


4.1 Molecular Partner Prediction 


We now turn our attention to a crucial question in cell biology: understand- 
ing clathrin-mediated endocytosis (CME) [32,34]. It is the primary pathway 
by which things are transported into the cell, making it essential functions 
of higher eukaryotic life [46]. Many questions about this process remain unan- 
swered, prompting a line of studies aiming to better understand this process [33]. 
One major challenge with analysis of CME, is the ability to readily distinguish 
between abortive coats (ACs) and successful clathrin-coated pits (CCPs). Doing 
so enables an understanding of what mechanisms allow for successful endocy- 
tosis. This is a challenging problem where DNNs have recently been shown to 
outperform classical statistical and ML methods. 

Figure 7 shows the pipeline for this challenging problem. Tracking algorithms 
run on videos of cells identify time-series traces of endocytic events. An LSTM 
model learns to classify which endocytic events are successful and CD scores 
identify which parts of the traces the model uses. Using these CD scores, domain 
experts are able to validate that the model does, in fact use reasonable features 
such as the max value of the time-series traces and the length of the trace. 


- Dynamin 


WA CD Score 


Videos of cells Extracted traces LSTM model Interpretation Distilled wavelet model 


Fig. 7. Molecular partner prediction pipeline. (A) Tracking algorithms run on videos 
of cells identify (B) time-series traces of endocytic events. (C) An LSTM model learns 
to classify which endocytic events are successful and (D) CD scores identify which 
parts of the traces the model uses. (E) AWD distills the LSTM model into a simple 
wavelet model which is able to obtain strong predictive performance. 


However, the LSTM model is still relatively difficult to understand and com- 
putationally intensive. To create an extremely transparent model, we extract 
only the maximum 6 wavelet coefficients at each scale. By taking the maximum 
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coefficients, these features are expected to be invariant to the specific locations 
where a CME event occurs in the input data. This results in a final model with 
30 coefficients (6 wavelet coefficients at 5 scales). These wavelet coefficients are 
used to train a linear model, and the best hyperparameters are selected via cross- 
validation on the training set. Figure 7 shows the best learned wavelet (for one 
particular run) extracted by AWD corresponding to the setting of hyperparam- 
eters A = 0.005 and y = 0.043. Table4 compares the results for AWD to the 
original LSTM and the initialized, non-adaptive DB5 wavelet model, where the 
performance is measured via a standard R? score, a proportion of variance in the 
response that is explained by the model. The AWD model not only closes the gap 
between the standard wavelet model (DB5) and the neural network, it consider- 
ably improves the LSTM’s performance (a 10% increase of R? score). Moreover, 
we calculate the compression rates of the AWD wavelet and DB5—these rates 
measure the proportion of wavelet coefficients in the test set, in which the magni- 
tude and the attributions are both above 1073. The AWD wavelet exhibits much 
better compression than DB5 (an 18% reduction), showing the ability of AWD 
to simultaneously provide sparse representations and explain the LSTM’s pre- 
dictions concisely. The AWD model also dramatically decreases the computation 
time at test time, a more than 200-fold reduction when compared to LSTM. 

In addition to improving prediction accuracy, AWD enables domain experts 
to vet their experimental pipelines by making them more transparent. By 
inspecting the learned wavelet, AWD allows for checking what clathrin signa- 
tures signal a successful CME event; it indicates that the distilled wavelet aims 
to identify a large buildup in clathrin fluorescence (corresponding to the build- 
ing of a clathrin-coated pit) followed by a sharp drop in clathrin fluorescence 
(corresponding to the rapid deconstruction of the pit). This domain knowledge 
is extracted from the pre-trained LSTM model by AWD using only the saliency 
interpretations in the wavelet space. 


Table 4. Performance comparisons for different models in molecular-partner predic- 
tion. AWD substantially improves predictive accuracy, compression rate, and compu- 
tation time on the test set. A higher R? score, and lower compression factor, and lower 
computation time indicate better results. For AWD, values are averaged over 5 different 
random seeds. 


AWD (Ours) | Standard wavelet (DB5) LSTM 
Regression (R? score) | 0.262 (0.001) | 0.197 0.237 
Compression factor | 0.574 (0.010) 0.704 N/A 
Computation time 0.0002s 0.0002 s 0.0449 s 


To see the effect of interpretation loss on learning the wavelet transforms 
and increased performance, we also learn the wavelet transform while setting 
the interpreration loss to be zero. In this case, the best regression R? score 
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selected via cross-validation is 0.231, and the adaptive wavelets without the 
interpretation loss still outperforms the baseline wavelet but fail to outperform 
the neural network models. 


4.2 Cosmological Parameter Prediction 


We now turn to a cosmology example, where attributing importance to trans- 
formations helps understand cosmological models in a more meaningful feature 
space. Specifically, we consider weak gravitational lensing convergence maps, i.e. 
maps of the mass distribution in the Universe integrated up to a certain distance 
from the observer. In a cosmological experiment (e.g. a galaxy survey), these 
mass maps are obtained by measuring the distortion of distant galaxies caused 
by the deflection of light by the mass between the galaxy and the observer [9]. 
These maps contain a wealth of physical information of interest to cosmologists, 
such as the total matter density in the universe, 2m. Current research aims at 
identifying the most informative features in these maps for inferring the true 
cosmological parameters, with DNN-based inference methods often obtaining 
state-of-the-art results [25,58,59]. 

In this context, it is important to not only have a DNN that predicts well, but 
also understand what it learns. Knowing which features are important provides 
deeper understanding and can be used to design optimal experiments or analysis 
methods. Moreover, because this DNN is trained on numerical simulations (real- 
izations of the Universe with different cosmological parameters), it is important 
to validate that it uses physical features rather than latching on to numerical 
artifacts in the simulations. TRIM can help understand and validate that the 
DNN learns appropriate physical features by analyzing attributing importance 
in the spectral domain. 

A DNN is trained to accurately predict Rm from simulated weak gravitational 
lensing convergence maps (full details in [67]). To understand what features the 
model is using, we desire an interpretation in the space of the power spectrum. 
The images in Fig. 8 show how different information is contained within dif- 
ferent frequency bands in the mass maps. The plot in Fig. 8 shows the TRIM 
attributions with CD (normalized by the predicted value) for different frequency 
bands when predicting the parameter Nm. Interestingly, the most important 
frequency band for the predictions seems to peak at scales around £ = 104 and 
then decay for higher frequencies.” A physical interpretation of this result is that 
the DNN concentrates on the most discriminative part of the Power Spectrum, 
i.e. at scales large enough not to be dominated by sample variance, and smaller 
than the frequency cutoff at which the simulations lose power due to resolution 
effects. 

Figure 9 shows some of the curves from Fig. 8 separated based on their 
cosmology, to show how the curves vary with the value of Nm. Increasing the 
value of Nm increases the contribution of scales close to £ = 10+, making other 


5 Here the unit of frequency used is angular multipole £. 
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Fig. 8. Different scales (i.e. frequency bands) contribute differently to the prediction of 
Nm. Each blue line corresponds to one testing image and the red line shows the mean. 
Images show the features present at different scales. The bandwidth is Ag = 2,700. 
(Color figure online) 


frequencies relatively unimportant. This seems to correspond to known cosmo- 
logical knowledge, as these scales seem to correspond to galaxy clusters in the 
mass maps, which are structures very sensitive to the value of Nm. The fact that 
the importance of these features varies with 2m would seem to indicate that at 
lower §2,, the model is using a different source of information, not located at any 
single scale, for making its prediction. 


4.3 Improving Skin Cancer Classification via CDEP 


In recent years, deep learning has achieved impressive results in diagnosing skin 
cancer [24]. However, the datasets used to train these models often include spuri- 
ous features which make it possible to attain high test accuracy without learning 
the underlying phenomena [79]. In particular, a popular dataset from ISIC (Inter- 
national Skin Imaging Collaboration) has colorful patches present in approxi- 
mately 50% of the non-cancerous images but not in the cancerous images as can 
be seen in Fig. 10 [14]. We use CDEP to remedy this problem by penalizing the 
DNN placing importance on the patches during training. 
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Fig. 9. TRIM attributions vary with the value of Qm. 
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Fig. 10. Example images from the ISIC dataset. Half of the benign lesion images 
include a patch in the image. Training on this data results in the neural network overly 
relying on the patches to classify images; CDEP avoids this. 


The task in this section is to classify whether an image of a skin lesion 
contains (1) benign melanoma or (2) malignant melanoma. In a real-life task, 
this would for example be done to determine whether a biopsy should be taken. 
In order to identify the spurious patches, binary maps of the patches for the skin 
cancer task are segmented using SLIC, a common image-segmentation algorithm 
[1]. After the spurious patches were identified, they are penalized using to have 
zero importance. 

Table 5 shows results comparing the performance of a DNN trained with 
and without CDEP. We report results on two variants of the test set. The first, 
which we refer to as “no patches” only contains images of the test set that do not 
include patches. The second also includes images with those patches. Training 


Interpreting Deep-Learning Models with Reality Checks 249 


with CDEP improves the AUC and F1-score for both test sets, compared to 
both a Vanilla DNN and using the RRR method introduced in [62]. Further 


visual inspection shows that the DNN attributes low importance to regions in 
the images with patches. 


Table 5. Results from training a DNN on ISIC to recognize skin cancer (averaged over 
three runs). Results shown for the entire test set and for only the test-set images that 
do not include patches (“no patches”). The network trained with CDEP generalizes 
better, getting higher AUC and F1 on both. 


AUC (no patches) F1 (no patches) | AUC (all) F1 (all) 
Vanilla | 0.93 0.67 0.96 0.67 
RRR | 0.76 0.45 0.87 0.45 
CDEP | 0.95 0.73 0.97 0.73 


5 Discussion 


Overall, the interpretation methods here are shown to (1) accurately recover 
known importances for features/feature interactions [49], (2) correctly inform 
human decision-making and be robust to adversarial perturbations [68], and 
(3) reliably alter a neural network’s predictions when regularized appropriately 
[60]. For each case, we demonstrated the use of reality checks through predictive 
accuracy (the most common form of reality check) or through domain knowledge 
which is relevant to a particular domain/audience. 

There is considerable future work to do in developing and evaluating attri- 
butions, particularly in distilling/building interpretable models for real-world 
domains and understanding how to better make useful interpretation methods. 
Below we discuss them in turn. 


5.1 Building/Distilling Accurate and Interpretable Models 


In the ideal case, a practitioner can develop a simple model to make their pre- 
dictions, ensuring interpretability by obviating the need for post-hoc interpreta- 
tion. Interpretable models tend to be faster, more computationally efficient, and 
smaller than their DNN counterparts. Moreover, interpretable models allow for 
easier inspection of knowledge extracted from the learned models and make real- 
ity checks more transparent. AWD [31] represents one effort to use attributions 
to distill DNNs into an interpretable wavelet model, but the general idea can 
go much further. There are a variety of interpretable models, such as rule-based 
models [37,69, 74] or additive models [13] whose fitting process could benefit from 
accurate attributions. Moreover, AWD and related techniques could be extended 
beyond the current setting to unsupervised/reinforcement learning settings or 
to incorporate multiple layers. Alternatively, attributions can be used as feature 
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engineering tools, to help build simpler, more interpretable models. More useful 
features can help enable better exploratory data analysis, unsupervised learning, 
or reality checks. 


5.2 Making Interpretations Useful 


Furthermore, there is much work remaining to improve the relevancy of inter- 
pretations for a particular audience/problem. Given the abundance of possible 
interpretations, it is particularly easy for researchers to propose novel methods 
which do not truly solve any real-world problems or fail to faithfully capture 
some aspects of reality. A strong technique to avoid this is to directly test newly 
introduced methods in solving a domain problem. Here, we discussed several 
real-data problems that have benefited from improved interpretations Sect. 4, 
spanning from cosmology to cell biology. In instances like this, where interpreta- 
tions are used directly to solve a domain problem, their relevancy is indisputable 
and reality checks can be validated through domain knowledge. A second, less 
direct, approach is the use of human studies where humans are asked to perform 
tasks, such as evaluating how much they trust a model’s predictions [68]. While 
challenging to properly construct and perform, these studies are vital to demon- 
strating that new interpretation methods are, in fact, relevant to any potential 
practitioners. We hope the plethora of open problems in various domains such as 
science, medicine, and public policy can help guide and benefit from improved 
interpretability going forward. 
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Abstract. Increased explainability in machine learning is traditionally associ- 
ated with lower performance, e.g. a decision tree is more explainable, but less 
accurate than a deep neural network. We argue that, in fact, increasing the explain- 
ability of a deep classifier can improve its generalization. In this chapter, we sur- 
vey a line of our published work that demonstrates how spatial and spatiotemporal 
visual explainability can be obtained, and how such explainability can be used to 
train models that generalize better on unseen in-domain and out-of-domain sam- 
ples, refine fine-grained classification predictions, better utilize network capacity, 
and are more robust to network compression. 


Keywords: Explainability - Interpretability - Deep learning - Saliency 


1 Introduction 


Deep learning is now widely used in state-of-the-art Artificial Intelligence (AI) tech- 
nology. A Deep Neural Network (DNN) model however is, thus far, a “black box.” AI 
applications in finance, medicine, and autonomous vehicles demand justifiable predic- 
tions, barring most deep learning methods from use. Understanding what is going on 
inside the “black box” of a DNN, what the model has learned, and how the training 
data influenced that learning are all instrumental as AI serves humans and should be 
accountable to humans and society. 

In response, Explainable AI (XAT) popularizes a series of visual explanations called 
saliency methods, that highlight pixels that are “important” for a model’s final predic- 
tion to which we contribute multiple works that target understanding deep model behav- 
ior through the analysis of saliency maps that highlight regions of evidence used by the 
model. We then contribute works that utilize such saliency to obtain models that have 
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improved accuracy, network utilization, robustness, and domain generalization. In this 
work, we provide an overview of our contributions in this field. 


XAI in Visual Data. Grounding model decisions in visual data has the benefit of being 
clearly interpretable by humans. The evidence upon which a deep convolutional model 
participates in the class conditional probability for a specific class is highlighted in the 
form of a saliency map. In our work [36], we present applications of spatial grounding 
in model interpretation, data annotation assistance for facial expression analysis and 
medical imaging tasks, and as a diagnostic tool for model misclassifications. We do so 
in a discriminative way that highlights evidence for every possible outcome given the 
same input for any deep convolutional neural network classifier. 

We also propose a black-box grounding techniques RISE [22] and D-RISE [23]. 
Unlike the majority of previous approaches RISE can produce saliency maps without 
the access to the internal states of the base model, such as weights, gradients or feature 
maps. The advantages of such a black-box approach are that RISE does not assume any 
specifics about the base model architecture, it can be used to test proprietary models that 
do not allow full access, the implementation is very easily adapted to a new base model. 
The saliency is computed by perturbing the input image using a set of randomized 
masks while keeping track of the changes in the output. Major changes in the output are 
reflected in increased saliency of the perturbed region of the input, see Fig. 2. 

Deep recurrent models are state-of-the-art for many vision tasks including video 
action recognition and video captioning. Models are trained to caption or classify activ- 
ity in videos, but little is known about the evidence used to make such decisions. Our 
work was the first to formulate top-down saliency in deep recurrent models for space- 
time grounding of videos [1]. We do so using a single contrastive backward pass of an 
already trained model. This enables the visualization of spatiotemporal cues that con- 
tribute to a deep model’s classification/captioning output and localization of segments 
within a video that correspond with a specific action, or phrase from a caption, without 
explicitly optimizing/training for these tasks. 


XAI for Improved Models. We propose three frameworks that utilize explanations 
to improve model accuracy. The first proposes a guided dropout regularizer for deep 
networks [39] based on the explanation of a network prediction defined as the firing 
of neurons in specific paths. The explanation at each neuron is utilized to determine 
the probability of dropout, rather than dropping out neurons uniformly at random as 
in standard dropout. This results in dropping out with higher probability neurons that 
contribute more to decision making at training time, forcing the network to learn alter- 
native paths in order to maintain loss minimization, resulting in a plasticity-like behav- 
ior, a characteristic of human brains. This demonstrates better generalization ability, an 
increased utilization of network neurons, and a higher resilience to network compres- 
sion for image/video recognition. 

Our second training strategy not only leads to a more explainable AI system for 
object classification, but as a consequence, suffers no perceptible accuracy degrada- 
tion [40]. Our training strategy enforces a periodic saliency-based feedback to encour- 
age the model to focus on the image regions that directly correspond to the ground- 
truth object. We propose explainability as a means for bridging the visual-semantic gap 
between different domains where model explanations are used as a means of disen- 
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tagling domain specific information from otherwise relevant features. We demonstrate 
that this leads to improved generalization to new domains without hindering perfor- 
mance on the original domain. 

Our third strategy is applied at test time and improves model accuracy by zooming in 
on the evidence, and ensuring the model has “the right reasons” for a prediction, being 
defined as reasons that are coherent with those used to make similar correct decisions 
at training time [2,3]. The reason/evidence upon which a deep neural network makes a 
prediction is defined to be the spatial grounding, in the pixel space, for a specific class 
conditional probability in the model output. We use evidence grounding as the signal to 
a module that assesses how much one can trust a Convolutional Neural Network (CNN) 
prediction over another. 

The rest of this chapter is organized as follows. Section2 presents saliency 
approaches that target explaining how deep neural network models associate input 
regions to output predictions. Sections3, 4, and 5 present approaches that utilize 
explainability in the form of saliency (Sect. 2) to obtain models that possess state-of-the- 
art in-domain and out-of-domain accuracy, have improved neuron utilization, and are 
more robust to network compression. Section 6 concludes the presented line of works. 


2 Saliency-Based XAI in Vision 


In this section we propose sample white- and black-box methods for saliency-based 
explainability for vision models. 


2.1 White-Box Models 


We first present sample white-box grounding techniques developed for the purpose of 
explainability of deep vision models. Formulation of white-box techniques assumes 
knowledge of model architectures and parameters. 


Spatial. In a standard spatial CNN, the forward activation of neuron a; is computed by 
a; = (X; wWijãi + bi), where G; is the activation coming from the previous layer, ¢ 
is a nonlinear activation function, w;; and b; are the weight from neuron 7 to neuron 
j and the added bias at layer 7, respectively. Excitation Backprop (EB) was proposed 
in [37] to identify the task-relevant neurons in any intermediate layer of a pre-trained 
CNN network. EB devises a backpropagation formulation that is able to reconstruct the 
evidence used by a deep model to make decisions. It computes the probability of each 
neuron recursively using conditional probabilities P(a;|a,;) in a top-down order starting 
from a probability distribution over the output units, as follows: 


P(ai)= X` P(ala;)P(q) (1) 


ajEPi 


where P; is the parent node set of a;. EB passes top-down signals through excitatory 
connections having non-negative activations, excluding from the competition inhibitory 
ones. EB is designed with an assumption of non-negative activations that are positively 
correlated with the detection of specific visual features. Most modern CNNs use ReLU 
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activation functions, which satisfy this assumption. Therefore, negative weights can be 
assumed to not positively contribute to the final prediction. Assuming C4 the child node 
set of aj, for each a; € C}, the conditional winning probability P(a;|a;) is defined as 


J J J (2) 


Ea ea) i otherwise 

where Z; is a normalization factor such that a probability distribution is maintained, 
pare EC; P(a;|a;) = 1. Recursively propagating the top-down signal and preserving 
the sum of backpropagated probabilities, it is possible to highlight the salient neurons 
in each layer using Eq. 1, i.e. neurons that mostly contribute to a specific task. This has 
been shown to accurately localize spatial objects in images (corresponding to object 
classes) in a weakly-supervised way. 


Spatiotemporal. Spatiotemporal explainability is instrumental for applications like 
action detection and image/video captioning [32]. We extend EB to become spatiotem- 
poral [1]. This work is the first to formulate top-down saliency in deep recurrent models 
for space-time grounding of videos. In this section we explain the details of our spa- 
tiotemporal grounding framework: cEB-R. As illustrated in Fig. 1, we have three main 
modules: RNN Backward, Temporal normalization, and CNN Backward. 

The RNN Backward module implements an excitation backprop formulation for 
RNNs. Recurrent models such as LSTMs are well-suited for top-down temporal 
saliency as they explicitly propagate information over time. The extension of EB for 
Recurrent Networks, EB-R, is not straightforward since EB must be implemented 
through the unrolled time steps of the RNN and since the original RNN formulation 
contains tanh non-linearities which do not satisfy the EB assumption. [6, 10] have con- 
ducted an analysis over variations of the standard RNN formulation, and discovered that 
different non-linearities performed similarly for a variety of tasks. Based on this, we use 
ReLU nonlinearities and corresponding derivatives, instead of tanh. This satisfies the 
EB assumption, and results in similar performance on both tasks. 

Working backwards from the RNN’s output layer, we compute the conditional win- 
ning probabilities from the set of output nodes O, and the set of dual output nodes O: 


Za'w;;, if wi; >0 
P*(a;|a;) = GU Wij, ij Ž Y, 3 
laslas) fi otherwise. 6) 
P | ) Z;QDi;, if Wiz > 0, (4) 
az|aj) = i 
: 0, otherwise. 


2; =1/ Pies’ a@jw;; is a normalization factor such that the sum of all conditional 
probabilities of the children of aj (Eqs. 3, 4) sum to 1; wi; € W where W is the set 
of model weights and w;; is the weight between child neuron a; and parent neuron aj; 
Wi; © W where W is obtained by negating the model weights at the classification layer 


=t ; . 
only. P (a;|a;) is only needed for contrastive attention. 
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Fig. 1. Our proposed framework spatiotemporally highlights/grounds the evidence that an RNN 
model used in producing a class label or caption for a given input video. In this example, by 
using our proposed back-propagation method, the evidence for the activity class CliffDiving is 
highlighted in a video that contains CliffDiving and HorseRiding. Our model employs a single 
backward pass to produce saliency maps that highlight the evidence that a given RNN used in 
generating its outputs. 


We compute the neuron winning probabilities starting from the prior distribution 
encoding a given action/caption as follows: 


Plai) = XO P*(ajla;)P*(a;) (5) 
ajEPi 

P'(ai)= X P'(ailaj)P (az) (6) 
ajEP. 


where P; is the set of parent neurons of a;. 

Replacing tanh non-linearities with ReLU non-linearities to extend EB in time does 
not suffice for temporal saliency. EB performs normalization at every layer to maintain 
a probability distribution. For spatiotemporal localization, the Temporal Normalization 
module normalizes signals from the desired nt” time-step of a T-frame clip in both time 
and space (assuming S' neurons in current layer) before being further backpropagated 


into the CNN: z 7 
Ph (ai) = Pt (ai) / Xa Yai Pt (a;). (7) 
Palai) = P’ (ai) / OF, Le P (o). (8) 


cEB-R computes the difference between the normalized saliency maps obtained by EB- 
R starting from O, and EB-R starting from O using negated weights of the classifica- 
tion layer. cEB-R is more discriminative as it grounds the evidence that is unique to 
a selected class/word and not common to other classes used at training time. This is 
conducted as follows: 


Map*(a;) = Ph (ai) — Py (ai). (9) 

For every video frame f; at time step t, we use the backprop of [37] for all CNN 
layers in the CNN Backward module: 

ZjGiwi;, if wij = 0, 


; (10) 
0, otherwise 


P“ (aila) = i 
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Map'(ai) = X` P'(aiļaj) Map (az) (11) 
ajEP; 


where a! is the activation when frame f, is passed through the CNN. Map’ at the 
desired CNN layer is the cCEB-R saliency map for f+. Computationally, the complexity 
of cEB-R is on the order of a single backward pass. Note that for EB-R, Px(a;) is used 
instead of Map‘(a;) in Eq. 11. 

The general framework has been applied to action localization. We ground the evi- 
dence of a specific action using a model trained on this task. The input is a video 
sequence and the action to be localized, and the output is spatiotemporal saliency maps 
for this action in the video. Performing cEB-R results in a sequence of saliency maps 
Map’ for t = 1, ..., T. These maps can then be used for localizing the action by find- 
ing temporal regions of highest aggregate saliency. This has also been applied to other 
spatiotemporal applications such as image and video captioning. 


2.2 Black-Box Models 


Black-box methods operate under the assumption that no internal information about the 
model is available. Thus we can only observe the final output of the model for each 
input that we provide. In this paradigm to explain the black-box model one has to come 
up with a way to query the model in such a way, that the outputs would reveal some of 
the underlying behaviour of the model. This methods are typically slower than white- 
box approaches since information is obtained at the cost of additional queries to the 
model. 

One way to construct the queries is to run the model on similar versions of the input 
and analyze the differences in the output. For example, to compute how important dif- 
ferent regions of the inputs are, i.e. compute saliency, one can mask out certain parts of 
the image. Significant changes in the output would mean the importance of the masked 
region. 

Our method RISE [22] builds on this idea. We probe the base model by perturbing 
the input image using random masks and record its responses to each of the masked 
images. The saliency map S is computed as a weighted sum of the used masks, where 
the weights come from the probabilities predicted by the base model (see Fig. 2): 


1 A 
m= XC JUOM)-M, (12) 


Mem M&M 


where f is the base model, J is the input image and M is the set of generated masks. The 
mask M has large weight f(I©M) in the sum only if the score of the base model is high 
on the masked image, i.e. the mask preserves important regions. We generate masks as 
a uniformly random binary grid (bilinearly upsampled) to refrain from imposing any 
priors on the resulting saliency maps. 

RISE can be applied to explain models that predict a distribution over labels given 
an image such as classification and captioning models. Classification saliency methods 
fail when directly applied to the object detection models. To generate such saliency 
maps for object detectors we propose D-RISE method [23]. It accounts for the differ- 
ences in object detection model’s structure and output format. To measure the effect of 
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Fig. 2. RISE overview 


the masks on the model output we propose a similarity metric between detection two 
proposals d; and dj: 


s(d:, dj) = sL (di, dj) - sp(di, dj) - so(dt, dj), (13) 


This metrics computes similarity values for the three components of the detection pro- 
posals: localization (bounding box L), classification (class probabilities P), and object- 
ness score (O). 


sL(di, dj) = IoU(Lz, Li), (14) 
P,- P, 

dd j= (15) 

Pde di) = TRIP 


Using the masking technique and the similarity metric D-RISE can compute 
saliency maps for object detectors in the similar querying manner. We use D-RISE to 
gain insights into the use of context by the detector. We demonstrate how to use saliency 
to better understand the use of correlations in the data by the model, e.g. ski poles are 
used when detecting the ski class. We also demonstrate the utility of saliency maps for 
detecting accidental or adversarial biases in the data. 


3 XAI for Improved Models: Excitation Dropout 


Dropout avoids overfitting on training data, allowing for better generalization on unseen 
test data. In this work, we target at determining how the dropped neurons are selected, 
answering the question Which neurons to drop out? 

Our approach [39] is inspired by brain plasticity [8, 17,18,29]. We deliberately, and 
temporarily, paralyze/injure neurons to enforce learning alternative paths in a deep 
network. At training time, neurons that are more relevant to the correct prediction, 
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Fig. 3. Training pipeline of Excitation Dropout. Step 1: A minibatch goes through the standard 
forward pass. Step 2: Backward EB is performed until the specified dropout layer; this gives a 
neuron saliency map at the dropout layer in the form of a probability distribution. Step 3: The 
probability distribution is used to generate a binary mask for each image of the batch based on 
a Bernoulli distribution determining whether each neuron will be dropped out or not. Step 4: A 
forward pass is performed from the specified dropout layer to the end of the network, zeroing 
the activations of the dropped out neurons. Step 5: The standard backward pass is performed to 
update model weights. 


i.e. neurons having a high saliency, are given a higher dropout probability. The rele- 
vance of a neuron for making a certain prediction is quantified using Excitation Back- 
prop [37]. Excitation Backprop conveniently yields a probability distribution at each 
layer that reflects neuron saliency, or neuron contribution to the prediction being made. 
This is utilized in the training pipeline of our approach, named Excitation Dropout, 
which is summarized in Fig. 3. 


Method. In the standard formulation of dropout [9,31], the suppression of a neuron 
in a given layer is modeled by a Bernoulli random variable p which is defined as the 
probability of retaining a neuron, 0 < p < 1. Given a specific layer where dropout is 
applied, during the training phase, each neuron is turned off with a probability 1 — p. 

We argue for a different approach that is guided in the way it selects neurons to 
be dropped. In a training iteration, certain paths have high excitation contributing to 
the resulting classification, while other regions of the network have low responses. We 
encourage the learning of alternative paths (plasticity) through the temporary damaging 
of the currently highly excited path. We re-define the probability of retaining a neuron 
as a function of its contribution in the currently highly excited path 


Bt i= Pye N= 1) ones 2 ve 


where pgp is the probability backpropagated through the EB formulation (Eq. 1) in 
layer l, P is the base probability of retaining a neuron when all neurons are equally 
contributing to the prediction and N is the number of neurons in a fully-connected layer 
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Lor the number of filters in a convolutional layer l. The retaining probability defined in 
Eq. 17 drops the neurons that contribute the most to the recognition of a specific class, 
with higher probability. Dropping out highly relevant neurons, we retain less relevant 
ones and thus encourage them to awaken. 


Results. We evaluate the effectiveness of Excitation Dropout on popular network archi- 
tectures that employ dropout layers including AlexNet [14], VGG16 [28], VGG19 [28], 
and CNN-2 [19]. We perform dropout in the first fully-connected layer of the net- 
works and find that it results in a 1%-5% accuracy improvement in comparison to 
Standard Dropout and other proposed dropout variants in the literature including Adap- 
tive, Information, Standard, and Curriculum Dropout. These results have been validated 
on image and video datasets including UCF101 [30], Cifar10 [13], Cifar100 [13], and 
Caltech256 [7]. 

Excitation Dropout shows a higher number of active neurons, a higher entropy over 
activations, and a probability distribution ppp that is more spread (higher entropy over 
PeB) among the neurons of the layer, leading to a lower peak probability of pgg and 
therefore less specialized neurons. These results are observed to have consistent trends 
over all training iterations for examined image and video recognition datasets. Excita- 
tion Dropout also enables networks to have a higher robustness against network com- 
pression for all examined datasets. It is capable of maintaining a much less steep decline 
of GT probability as more neurons are pruned. Explainability has also been recently 
used to prune networks for transfer learning from large corpora to more specialized 
tasks [35]. 


4 XAI for Improved Models: Domain Generalization 


While Sect. 3 focuses on dropping neurons ‘relevant’ to a prediction as a means of net- 
work regularization within a particular domain, we now propose using such relevance 
to focus on domain agnostic features that can aid domain generalization. 

We develop a training strategy [40] for deep neural network models that increases 
explainability, suffers no perceptible accuracy degradation on the training domain, and 
improves performance on unseen domains. 

We posit that the design of algorithms that better mimic the way humans reason, 
or “explain”, can help mitigate domain bias. Our approach utilizes explainability as a 
means for bridging the visual-semantic gap between different domains as presented in 
Fig. 4. Specifically, our training strategy is guided by model explanations and available 
human-labeled explanations, mimicking interactive human feedback [26]. Explanations 
are defined as regions of visual evidence upon which a network makes a decision. This 
is represented in the form of a saliency map conveying how much each pixel contributed 
to the network’s decision. 

Our training strategy periodically guides the forward activations of spatial layer(s) 
of a Convolutional Neural Network (CNN) trained for object classification. The acti- 
vations are guided to focus on regions in the image that directly correspond to the 
ground-truth (GT) class label, as opposed to context that may more likely be domain 
dependent. The proposed strategy aims to reinforce explanations that are non-domain 
specific, and alleviate explanations that are domain specific. Classification models are 


264 S. A. Bargal et al. 


painting 
graphics 
noXAl 


Fig. 4. In this figure we demonstrate how explainability (XAI) can be used to achieve domain 
generalization from a single source. Training a deep neural network model to enforce explain- 
ability, e.g. focusing on the skateboard region (red is most salient, and blue is least salient) for the 
ground-truth class skateboard in the central training image, enables improved generalization to 
other domains where the background is not necessarily class-informative. (Color figure online) 


compact and fast in comparison to more complex semantic segmentation models. This 
allows the compact classification model to possess some properties of a segmentation 
model without increasing model complexity or test-time overhead. 


Method. We enforce focusing on objects in an image by scaling the forward activations 
of a particular spatial layer l in the network at certain epochs. We generate a multi- 
plicative binary mask for guiding the focus of the network in the layer in which we 
are enforcing XAI. For an explainable image x’, the binary mask is a binarization of 
the achieved saliency map, i.e. mask ,, = (s$ y > 0) Vj Yk, j = 1,...,W and 
k = 1,...,H, where W and H are the spatial dimension of a layers’ output neu- 
ron activations; The mask is active at locations of non-zero saliency. This re-inforces 
the activations corresponding to the active saliency regions that have been classified as 
being explainable. For images that need an improved explanation, the binary mask is 
assigned to be the GT spatial annotation maski ;, = Ik Vj Vk, j = 1,...,W and 
k = 1,...,H; The mask is active at GT locations. This increases the frequency at 
which the network reinforces activations at locations that are likely to be non-domain 
specific and suppresses activations at locations that are likely to be domain specific. 
We then perform element-wise multiplication of our computed mask with the forward 
activations of layer l; i.e. a - maski , a", Vj Yk, jg =1,...,Wandk =1,...,H. 


Results. The identification of evidence within a visual input using top-down neural 
attention formulations [27] can be a powerful tool for domain analysis. We demonstrate 
that more explainable deep classification models could be trained without hindering 
their performance. 
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We train ResNet architectures for the single-label classification task for the pop- 
ular MSCOCO [15] and PASCAL VOC [4] datasets. The XAI model resulted in a 
25% increase in the number of correctly classified images that result in better localiza- 
tion/explainability using the popular pointing game metric. The XAI model has learnt 
to rely less on context information, without hurting the performance. 

Thus far, evaluation assumed that a saliency map whose peak overlaps with the GT 
spatial annotation of the object is a better explanation. We then conduct a human study 
to confirm our intuitive quantification of an “explainable” model. The study asks users 
what they think is a better explanation for the presence of an object. XAI evidence won 
for 67% of the whole image population and 80% of the images with a winner choice. 

Finally, we demonstrate how the explainable model better generalizes from real 
images of MSCOCO/PASCAL VOC to six unseen target domains from the Domain- 
Net [20] and Syn2Real [21] datasets (clipart, quickdraw, infograph, painting, sketch, 
and graphics). 


5 XAI for Improved Models: Guided Zoom 


In state-of-the-art deep single-label classification models, the top-k (k = 2,3,4,...) 
accuracy is usually significantly higher than the top-1 accuracy. This is more evident 
in fine-grained datasets, where differences between classes are quite subtle. Exploit- 
ing the information provided in the top & predicted classes boosts the final prediction 
of a model. We propose Guided Zoom [3], a novel way in which explainability could 
be used to improve model performance. We do so by making sure the model has “the 
right reasons” for a prediction. The reason/evidence upon which a deep neural network 
makes a prediction is defined to be the grounding, in the pixel space, for a specific 
class conditional probability in the model output. Guided Zoom examines how reason- 
able the evidence used to make each of the top-k predictions is. In contrast to work 
that implements reasonableness in the loss function e.g. [24,25], test time evidence is 
deemed reasonable in Guided Zoom if it is coherent with evidence used to make similar 
correct decisions at training time. This leads to better informed predictions. 


Method. We now describe how Guided Zoom utilizes multiple discriminative evidence, 
does not require part annotations, and implicitly enforces part correlations. This is done 
through explanations of the main modules depicted in Fig. 5. 

Conventional CNNs trained for image classification output class conditional prob- 
abilities upon which predictions are made. The class conditional probabilities are the 
result of some corresponding evidence in the input image. From correctly classified 
training examples, we generate a reference pool P of (evidence, prediction) pairs 
over which the Evidence CNN will be trained for the same classification task. We 
recover/ground such evidence using several grounding techniques [1,22,27]. We extract 
the image patch corresponding to the peak saliency region. This patch highlights the 
most discriminative evidence. However, the next most discriminative patches may also 
be good additional evidence for differentiating fine-grained categories. 

Also, grounding techniques only highlight part(s) of an object. However, a more 
inclusive segmentation map can be extracted from the already trained model at test 
time using an iterative adversarial erasing of patches [33]. We augment our reference 
pool with patches resulting from performing iterative adversarial erasing of the most 
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Fig. 5. Pipeline of Guided Zoom. A conventional CNN outputs class conditional probabilities for 
an input image. Salient patches could reveal that evidence is weak. We refine the class prediction 
of the conventional CNN by introducing two modules: 1) Evidence CNN determines the con- 
sistency between the evidence of a test image prediction and that of correctly classified training 
examples of the same class. 2) Decision Refinement uses the output of Evidence CNN to refine 
the prediction of the conventional CNN. 


discriminative evidence from an image. We notice that adversarial erasing results in 
implicit part localization from most to least discriminative parts. All patches extracted 
from this process inherit the ground-truth label of the original image. By labeling dif- 
ferent parts with the same image ground-truth label, we are implicitly forcing part-label 
correlations in Evidence CNN. 

Including such additional evidence in our reference pool gives a richer description 
of the examined classes compared to models that recursively zoom into one location 
while ignoring other discriminative cues [5]. We note that we add an evidence patch to 
the reference pool only if the removal of the previous salient patch does not affect the 
correct classification of the sample image. Erasing is performed by adding a black-filled 
square on the previous most salient evidence to encourage a highlight of the next salient 
evidence. We then train a CNN model, Evidence CNN, on the generated evidence pool. 

At test time, we analyze whether the evidence upon which a prediction is made is 
reasonable. We do so by examining the consistency of a test (evidence, prediction) with 
our reference pool that is used to train Evidence CNN. We exploit the visual evidence 
used for each of the top-k predictions for Decision Refinement. The refined prediction 
will be inclined toward each of the top-k classes by an amount proportional to how 
coherent its evidence is with the reference pool. For example, if the (evidence, predic- 
tion) of the second-top predicted class is more coherent with the reference pool of this 
class, then the refined prediction will be more inclined toward the second-top class. 

Assuming test image sî, where j € 1,...,m and m is the number of testing exam- 
ples, sî is passed through the conventional CNN resulting in v7, a vector of class 


Beyond the Visual Analysis of Deep Model Saliency 267 


conditional probabilities having some top-k classes c1, ..., Cc, to be considered for the 
prediction refinement. We obtain the evidence for each of the top-k predicted classes 
ej ',...,e9", and pass each one through the Evidence CNN to get the output class 
conditional probability vectors a, pe woe. We then perform adversarial erasing to 


get the next most salient evidence e/’,...,e7’* and their corresponding class con- 


ditional probability vectors ae aid a, for! € 1,..., L. Finally, we compute a 


weighted combination of all class conditional probability vectors proportional to their 
saliency (a lower / has more discriminative evidence and is therefore assigned a higher 
weight w7). The estimated, refined class c’, f is determined as the class having the max- 
imum aggregate prediction in the weighted combination. 


Results. We show that Guided Zoom results in an improvement of a model’s classifica- 
tion accuracy on four fine-grained classification datasets: CUB-200-2011 Birds [34], 
Stanford Dogs [11], FGVC-Aircraft [16], and Stanford Cars [12] of various bird 
species, dog species, aircraft models, and car models. 

Guided Zoom is a generic framework that can be directly applied to any deep con- 
volutional model for decision refinement within the top-k predictions. Guided zoom 
demonstrates that multi-zooming is more beneficial than a single recursive zoom [5]. 
We also demonstrate that Guided Zoom further improves the performance of exist- 
ing multi-zoom approaches [38]. Choosing random patches to be used with original 
images, as opposed to Guided Zoom patches results in comparable results to using the 
original images on their own. Therefore, Guided Zoom presents performance gains that 
are complementary to data augmentation. 


6 Conclusion 


This chapter presents sample white- and black-box approaches to providing visual 
grounding as a form of explainable AI. It also presents a human judgement verifica- 
tion that such visual explainability techniques mostly agree with evidence humans use 
for the presence of visual cues. This chapter then demonstrates three strategies on how 
this preliminary form of explainable AI (also widely known as saliency maps) can be 
integrated into automated algorithms, that do not require human feedback, to improve 
fine-grained accuracy, in-domain and out-of-domain generalization, network utilization, 
and robustness to network compression. 
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Abstract. The remarkable success of deep neural networks (DNNs) in 
various applications is accompanied by a significant increase in network 
parameters and arithmetic operations. Such increases in memory and 
computational demands make deep learning prohibitive for resource- 
constrained hardware platforms such as mobile devices. Recent efforts 
aim to reduce these overheads, while preserving model performance as 
much as possible, and include parameter reduction techniques, parame- 
ter quantization, and lossless compression techniques. 

In this chapter, we develop and describe a novel quantization 
paradigm for DNNs: Our method leverages concepts of explainable AI 
(XAI) and concepts of information theory: Instead of assigning weight 
values based on their distances to the quantization clusters, the assign- 
ment function additionally considers weight relevances obtained from 
Layer-wise Relevance Propagation (LRP) and the information content 
of the clusters (entropy optimization). The ultimate goal is to preserve 
the most relevant weights in quantization clusters of highest information 
content. 

Experimental results show that this novel Entropy-Constrained 
and XAlL-adjusted Quantization (ECQ*) method generates ultra low- 
precision (2-5 bit) and simultaneously sparse neural networks while 
maintaining or even improving model performance. Due to reduced 
parameter precision and high number of zero-elements, the rendered net- 
works are highly compressible in terms of file size, up to 103x compared 
to the full-precision unquantized DNN model. Our approach was evalu- 
ated on different types of models and datasets (including Google Speech 
Commands, CIFAR-10 and Pascal VOC) and compared with previous 
work. 
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1 Introduction 


Solving increasingly complex real-world problems continuously contributes to the 
success of deep neural networks (DNNs) [37,38]. DNNs have long been estab- 
lished in numerous machine learning tasks and for this have been significantly 
improved in the past decade. This is often achieved by over-parameterizing mod- 
els, i.e., their performance is attributed to their growing topology, adding more 
layers and parameters per layer [18,41]. Processing a very large number of param- 
eters comes at the expense of memory and computational efficiency. The sheer 
size of state-of-the-art models makes it difficult to execute them on resource- 
constrained hardware platforms. In addition, an increasing number of parameters 
implies higher energy consumption and increasing run times. 

Such immense storage and energy requirements however contradict the 
demand for efficient deep learning applications for an increasing number of 
hardware-constrained devices, e.g., mobile phones, wearable devices, Internet 
of Things, autonomous vehicles or robots. Specific restrictions of such devices 
include limited energy, memory, and computational budget. Beyond these, typi- 
cal applications on such devices, e.g., healthcare monitoring, speech recognition, 
or autonomous driving, require low latency and/or data privacy. These latter 
requirements are addressed by executing and running the aforementioned appli- 
cations directly on the respective devices (also known as “edge computing” ) 
instead of transferring data to third-party cloud providers prior to processing. 

In order to tailor deep learning to resource-constrained hardware, a large 
research community has emerged in recent years [10,45]. By now, there exists a 
vast amount of tools to reduce the number of operations and model size, as well 
as tools to reduce the precision of operands and operations (bit width reduction, 
going from floating point to fixed point). Topics range from neural architecture 
search (NAS), knowledge distillation, pruning/sparsification, quantization and 
lossless compression to hardware design. 

Beyond all, quantization and sparsification are very promising and show great 
improvements in terms of neural network efficiency optimization [21,43]. Sparsi- 
fication sets less important neurons or weights to zero and quantization reduces 
parameters’ bit widths from default 32 bit float to, e.g., 4 bit integer. These 
two techniques enable higher computational throughput, memory reduction and 
skipping of arithmetic operations for zero-valued elements, just to name a few 
benefits. However, combining both high sparsity and low precision is challeng- 
ing, especially when relying only on the weight magnitudes as a criterion for the 
assignment of weights to quantization clusters. 

In this work, we propose a novel neural network quantization scheme to 
render low-bit and sparse DNNs. More precisely, our contributions can be sum- 
marized as follows: 


1. Extending the state-of-the-art concept of entropy-constrained quantization 
(ECQ) to utilize concepts of XAI in the clustering assignment function. 

2. Use relevances observed from Layer-wise Relevance Propagation (LRP) at 
the granularity of per-weight decisions to correct the magnitude-based weight 
assignment. 
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3. Obtaining state-of-the-art or better results in terms of the trade-off between 
efficiency and performance compared to the previous work. 


The chapter is organized as follows: First, an overview of related work is 
given. Second, in Sect.3, basic concepts of neural network quantization are 
explained, followed by entropy-constrained quantization. Section 4 describes the 
ECQ extension towards ECQ* as an explainability-driven approach. Here, LRP 
is introduced and the per-weight relevance derivation for the assignment function 
presented. Next, the ECQ* algorithm is described in detail. Section 5 presents 
the experimental setup and obtained results, followed by the final conclusion in 
Sect. 6. 


2 Related Work 


A large body of literature exists that has focused on improving DNN model 
efficiency. Quantization is an approach that has shown great success [14]. While 
most research focuses on reducing the bit width for inference, [52] and others 
focus on quantizing weights, gradients and activations to also accelerate back- 
ward pass and training. Quantized models often require fine-tuning or re-training 
to adjust model parameters and compensate for quantization-induced accuracy 
degradation. This is especially true for precisions <8 bit (cf. Fig. 1 in Sect. 3). 
Trained quantization is often referred to as “quantization-aware training”, for 
which additional trainable parameters may be introduced (e.g., scaling parame- 
ters [6] or directly trained quantization levels (centroids) [53]). A precision reduc- 
tion to even 1 bit was introduced by BinaryConnect [8]. However, this kind of 
quantization usually results in severe accuracy drops. As an extension, ternary 
networks allow weights to be zero, i.e., constraining them to 0 in addition to 
w_ and w+, which yields results that outperform the binary counterparts [28]. 
In DNN quantization, most clustering approaches are based on distance mea- 
surements between the unquantized weight distribution and the corresponding 
centroids. The works in [7] and [32] were pioneering in using Hessian-weighted 
and entropy-constrained clustering techniques. More recently the work of [34] use 
concepts from XAI for DNN quantization. They use DeepLIFT importance mea- 
sures which are restricted to the granularity of convolutional channels, whereas 
our proposed ECQ* computes LRP relevances per weight. 

Another method for reducing the memory footprint and computational cost 
of DNNs is sparsification. In the scope of sparsification techniques, weights with 
small saliency (i.e., weights which minimally affect the model’s loss function) are 
set to zero, resulting in a sparser computational graph and higher compressible 
matrices. Thus, it can be interpreted as a special form of quantization, having 
only one quantization cluster with centroid value 0 to which part of the param- 
eter elements are assigned to. This sparsification can be carried out as unstruc- 
tured sparsification [17], where any weight in the matrix with small saliency is 
set to zero, independently of its position. Alternatively, a structured sparsifica- 
tion is applied, where an entire regular subset of parameters is set to zero, e.g., 
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entire convolutional filters, matrix rows or columns [19]. “Pruning” is conceptu- 
ally related to sparsification but actually removes the respective weights rather 
than setting them to zero. This has the effect of changing the number of input 
and output shapes of layers and weight matrices'. Most pruning/sparsification 
approaches are magnitude-based, i.e., weight saliency is approximated by the 
weight values, which is straightforward. However, since the early 1990s methods 
that use, e.g., second-order Taylor information for weight saliency [27] have been 
used alongside other criteria ranging from random pruning to correlation and 
similarity measures (for the interested reader we recommend [21]). In [51], LRP 
relevances were first used for structured pruning. 

Generating efficient neural network representations can also be a result of 
combining multiple techniques. In Deep Compression [16], a three-stage model 
compression pipeline is described. First, redundant connections are pruned iter- 
atively. Next, the remaining weights are quantized. Finally, entropy coding is 
applied to further compress the weight matrices in a lossless manner. This three 
stage model is also used in the new international ISO/IEC standard on Neural 
Network compression and Representation (NNR) [24], where efficient data reduc- 
tion, quantization and entropy coding methods are combined. For coding, the 
highly efficient universal entropy coder DeepCABAC [47] is used, which yields 
compression gains of up to 63x. Although the proposed method achieves high 
compression gains, the compressed representation of the DNN weights require 
decoding prior to performing inference. In contrast, compressed matrix formats 
like Compressed Sparse Row (CSR) derive a representation that enables infer- 
ence directly in the compressed format [49]. 

Orthogonal to the previously described approaches is the research area of 
Neural Architecture Search (NAS) [12]. Both manual [36] and automated [44] 
search strategies have played an important role in optimizing DNN architectures 
in terms of latency, memory footprint, energy consumption, etc. Microstructural 
changes include, e.g., the replacement of standard convolutional layers by more 
efficient types like depth-wise or point-wise convolutions, layer decomposition or 
factorization, or kernel size reduction. The macro architecture specifies the type 
of modules (e.g., inverted residual), their number and connections. 

Knowledge distillation (KD) [20] is another active branch of research that 
aims at generating efficient DNNs. The KD paradigm leverages a large teacher 
model that is used to train a smaller (more efficient) student model. Instead 
of using the “hard” class labels to train the student, the key idea of model 
distillation is to deploy the teacher’s class probabilities, as they can contain 
more information about the input. 


1 In practice, pruning is often simulated by masking, instead of actually restructuring 
the model’s architecture. 
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Fig. 1. Difference in sensitivity between activation and weight quantization of the 
EfficientNet-BO model pre-trained on ImageNet. As a quantization scheme uniform 
quantization without re-training was used. Activations are more sensitive to quantiza- 
tion since model performance drops significantly faster. Going below 8 bit is challenging 
and often requires (quantization-aware) re-training of the model to compensate for the 
quantization error. Data originates from [50]. 


3 Neural Network Quantization 


For neural network computing, the default precision used on general hardware 
like GPUs or CPUs is 32 bit floating-point (“single-precision”), which causes 
high computational costs, power consumption, arithmetic operation latency and 
memory requirements [43]. Here, quantization techniques can also reduce the 
number of bits required to represent weight parameters and/or activations of 
the full-precision neural network, as they map the respective data values to 
a finite set of discrete quantization levels (clusters). Providing n such clusters 
allows to represent each data point in only logy n bit. However, the continuous 
reduction of the number of clusters generally leads to an increasingly large error 
and degraded performances (see the EfficientNet-B0? example in Fig. 1). 

This trade-off is a well-known problem in information theory and is addressed 
by rate-distortion optimization, a concept in lossy data compression. It aims to 
determine the minimal number of bits per data symbol (bitrate) at which the 
reconstruction of the compressed data does not exceed a certain level of distor- 
tion. Applying this to the domain of neural network quantization, the objective 
is to minimize the bitrate of the weight parameters while keeping model degrada- 
tion caused by quantization below a certain threshold, i.e., the predictive perfor- 
mance of the model should not be affected by reduced parameter precisions. In 
contrast to multimedia compression approaches, e.g., for audio or video coding, 
the compression of DNNs has unique challenges and opportunities. Foremost, 
the neural network parameters to be compressed are not perceived directly by 


? https: //github.com/lukemelas/EfficientNet-PyTorch, Apache License, Version 2.0 - 
Copyright (c) 2019 Luke Melas-Kyriazi. 
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Fig. 2. Quantizing a neural network’s layer weights (binned weight distribution shown 
as green bars) to 7 discrete cluster centers (centroids). The centroids (black bars) were 
generated by k-means clustering and the height of each bar represents the number of 
layer weights which are assigned to the respective centroid. 


a user, as e.g., for video data. Therefore, the coding or compression error or 
distortion cannot be directly used as performance measure. Instead, such accu- 
racy measurement needs to be deducted from a subsequent inference step. Then, 
current neural networks are highly over-parameterized [11] which allows for high 
errors/differences between the full-precision and the quantized parameters (while 
still maintaining model performance). Also, the various layer types and the loca- 
tion of a layer within the DNN have different impacts on the loss function, and 
thus different sensitivities to quantization. 

Quantization can be further classified into uniform and non-uniform quan- 
tization. The most intuitive way to initialize centroids is by arranging them 
equidistantly over the range of parameter values (uniform). Other quantization 
schemes make use of non-uniform mapping functions, e.g., k-means clustering, 
which is determined by the distribution of weight values (see Fig. 2). As non- 
uniform quantization captures the underlying distribution of parameter values 
better, it may achieve less distortion compared to equidistantly arranged cen- 
troids. However, non-uniform schemes are typically more difficult to deploy on 
hardware, e.g., they require a codebook (look-up table), whereas uniform quan- 
tization can be implemented using a single scaling factor (step size) which allows 
a very efficient hardware implementation with fixed-point integer logic. 


3.1 Entropy-Constrained Quantization 


As discussed in [49], and experimentally shown in [50], lowering the entropy of 
DNN weights provides benefits in terms of memory as well as computational 
complexity. The Entropy-Constrained Quantization (ECQ) algorithm is a clus- 
tering algorithm that also takes the entropy of the weight distributions into 
account. More precisely, the first-order entropy H = —}°. P. logs Pe is used, 
where P, is the ratio of the number of parameter elements in the c-th cluster to 


ECQ*: Explainability-Driven Quantization for Low-Bit and Sparse DNNs 277 


the number of all parameter elements (i.e., the source distribution). To recall, 
the entropy H is the theoretical limit of the average number of bits required to 
represent any element of the distribution [39]. 

Thus, ECQ assigns weight values not only based on their distances to the 
centroids, but also based on the information content of the clusters. Similar to 
other rate-distortion-optimization methods, ECQ applies Lagrange optimization: 


AY = argmin d(W, w) — AÙ logy (P). (1) 
c 

Per network layer l, the assignment matrix AY maps a centroid to each 
weight based on a minimization problem consisting of two terms: Given the 
full-precision weight matrix W and the centroid values w, the first term 
in Eq. (1) measures the squared distance between all weight elements and the 
centroids, indexed by c. The second term in Eq. (1) is weighted by the scalar 
Lagrange parameter A“) and describes the entropy constraint. More precisely, 
the information content I is considered, i.e., I = — log, (PS), where the proba- 
bility PME (0, 1] defines how likely a weight element wi? e W is going to be 
assigned to centroid w. Data elements with a high occurrence frequency, or a 
high probability, contain a low information content, and vice versa. P is calcu- 
lated layer-wise as PO = Ng ) INQ, with Ng ) being the number of full-precision 
weight elements assigned to the cluster with centroid value wl (based on the 
squared distance), and Nw being the total number of parameters in wW., Note 
that A® is scaled with a factor based on the number of parameters a layer has in 
proportion to other layers in the network to mitigate the constraint for smaller 
layers. 

The entropy regularization term motivates sparsity and low-bit weight quan- 
tization in order to achieve smaller coded neural network representations. Based 
on the specific neural network coding optimization, we developed ECQ. This 
algorithm is based on previous work in Entropy-Constrained Trained Ternar- 
ization (EC2T) [28]. EC2T trains sparse and ternary DNNs to state-of-the-art 
accuracies. 

In our developed ECQ, we generalize the EC2T method, such that DNNs of 
variable bit width can be rendered. Also, ECQ does not train centroid values 
to facilitate integer arithmetic on general hardware. The proposed quantization- 
aware training algorithm includes the following steps: 


1. Quantize weight parameters by applying ECQ (but keep a copy of the full- 
precision weights). 
2. Apply Straight-Through Estimator (STE) [5]: 
(a) Compute forward and backward pass through quantized model version. 
(b) Update full-precision weights with scaled gradients obtained from quan- 
tized model. 
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4 Explainability-Driven Quantization 


Explainable AI techniques can be applied to find relevant features in input as 
well as latent space. Covering large sets of data, identification of relevant and 
functional model substructures is thus possible. Assuming over-parameterization 
of DNNs, the authors of [51] exploit this for pruning (of irrelevant filters) to great 
effect. Their successful implementation shows the potential of applying XAI for 
the purpose of quantization as well, as sparsification is part of quantization, 
e.g., by assigning weights to the zero-cluster. Here, XAI opens up the possibility 
to go beyond regarding model weights as static quantities and to consider the 
interaction of the model with given (reference) data. This work aims to combine 
the two orthogonal approaches of ECQ and XAI in order to further improve 
sparsity and efficiency of DNNs. In the following, the LRP method is introduced, 
which can be applied to extract relevances of individual neurons, as well as 
weights. 


4.1 Layer-Wise Relevance Propagation 


Layer-wise Relevance Propagation (LRP) [3] is an attribution method based on 
the conservation of flows and proportional decomposition. It explicitly is aligned 
to the layered structure of machine learning models. Regarding a model with n 
layers 


f(x) = fao +0 fila), (2) 


LRP first calculates all activations during the forward pass starting with fı 
until the output layer fn is reached. Thereafter, the prediction score f(x) of any 
chosen model output is redistributed layer-wise as an initial quantity of rele- 
vance Rn back towards the input. During this backward pass, the redistribution 
process follows a conservation principle analogous to Kirchhoff’s laws in electri- 
cal circuits. Specifically, all relevance that flows into a neuron is redistributed 
towards neurons of the layer below. In the context of neural network predictors, 
the whole LRP procedure can be efficiently implemented as a forward-backward 
pass with modified gradient computation, as demonstrated in, e.g., [35]. 

Considering a layer’s output neuron j, the distribution of its assigned rele- 
vance score R; towards its lower layer input neurons 7 can be, in general, achieved 
by applying the basic decomposition rule 


Zj 

where zij describes the contribution of neuron 7 to the activation of neuron 
j [3,29] and z; is the aggregation of the pre-activations z;j at output neuron j, 
i.e., 2; = J; zij- Here, the denominator enforces the conservation principle over 
all ¿ contributing to j, meaning }>, Rj—; = Rj. This is achieved by ensuring the 
decomposition of R; is in proportion to the relative flow of activations z;;/z; in 
the forward pass. The relevance of a neuron 7 is then simply an aggregation of 
all incoming relevance quantities 
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Fig. 3. LRP can be utilized to calculate relevance scores for weight parameters W, 
which contribute to the activation of output neurons z; during the forward pass in 
interaction with data-dependent inputs a;. In the backward pass, relevance messages 
Ri—j can be aggregated at neurons/input activations a;, but also at weights W. 


ay n Zj Qi 


Rie; 


Rea) Rice (4) 


Given the conservation of relevance in the decomposition step of Eq. (3), this 
means that >, Ri = >), Rj holds for consecutive neural network layers. Next 
to component-wise non-linearities, linearly transforming layers (e.g., dense or 
convolutional) are by far the most common and basic building blocks of neural 
networks such as VGG-16 [41] or ResNet [18]. While LRP treats the former via 
identity backward passes, relevance decomposition formulas can be given for the 
latter explicitly in terms of weights w;; and input activations a;. Let the output 
of a linear neuron be given as zj = Dao Zij = Xio aiwij with bias “weight” 
woj and respective activation aj) = 1. In accordance to Eq. (3), relevance is then 
propagated as 


explicit mod. grad. mod. grad. 
m nm E 
= i J a j 
Rie; = QiWij — = a; Wij => = Wij Qa; —. (5) 
Ss” Zj w Zj aA Zj 
Zij 025 825 
Jai Əwij 


Equation (5) exemplifies, that the explicit computation of the backward directed 
relevances R;—; in linear layers can be replaced equivalently by a (modified) 
“gradient x input” approach. Therefore, the activation a; or weight w;; can act 
as the input and target wrt. which the partial derivative regarding output zj 
is computed. The scaled relevance term R;/z; takes the role of the upstream 
gradient to be propagated. 

At this point, LRP offers the possibility to calculate relevances not only of 
neurons, but also of individual weights, depending on the aggregation strategy, 
as illustrated in Fig.3. This can be achieved by aggregating relevances at the 
corresponding (gradient) targets, i.e., plugging Eq. (5) into Eq. (4). For a dense 
layer, this yields 

Rw; = Rig; (6) 


with an individual weight as the aggregation target contributing (exactly) once 
to an output. A weight of a convolutional filter however is applied multiple 
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times within a neural network layer. Here, we introduce a variable k signifying 
one such application context, e.g., one specific step in the application of a filter w 
in a (strided) convolution, mapping the filter’s inputs 7 to an output j. While the 
relevance decomposition formula within one such context k does not change from 
Eq. (3), we can uniquely identify its backwards distributed relevance messages 
as RE j- With that, the aggregation of relevance at the convolutional filter w at 
a given layer is given with 

Rwy = 5 Rij» (7) 

k 

where k iterates over all applications of this filter weight. 

Note that in modern deep learning frameworks, derivatives wrt. activations or 
weights can be computed efficiently by leveraging the available automatic differ- 
entiation functionality (autograd) [33]. Specifying the gradient target, autograd 
then already merges the relevance decomposition and aggregation steps outlined 
above. Thus, computation of relevance scores for filter weights in convolutional 
layers is also appropriately supported, for Eq. (3), as well as any other relevance 
decomposition rule which can be formulated as a modified gradient backward 
pass, such as Eqs. (8) and (9). The ability to compute the relevance of individual 
weights is a critical ingredient for the eXplainability-driven Entropy-Constrained 
Quantization strategy introduced in Sect. 4.2. 

In the following, we will briefly introduce further LRP decomposition rules 
used throughout our study. In order to increase numerical stability of the basic 
decomposition rule in Eq. (3), the LRP e-rule introduces a small term ¢ in the 
denominator: ii 

t, 

Hye zj +e-sign(z;) ae (8) 
The term £ absorbs relevance for weak or contradictory contributions to the 
activation of neuron j. Note here, in order to avoid divisions by zero, the sign(z) 
function is defined to return 1 if z > 0 and —1 otherwise. In the case of a deep 
rectifier network, it can be shown [1] that the application of this rule to the whole 
neural network results in an explanation that is similar to (simple) “gradient x 
input” [40]. A common problem within deep neural networks is, that the gradient 
becomes increasingly noisy with network depth [35], partly a result from gradient 
shattering [4]. The € parameter is able to suppress the influence of that noise 
given sufficient magnitude. With the aim of achieving robust decompositions, 
several purposed rules next to Eqs. (3) and (8) have been proposed in literature 
(see [29] for an overview). 

One particular rule choice, which reduces the problem of gradient shattering 
and which has been shown to work well in practice, is the a@-rule [3,30] 


ae ae 
Feats (e COCE ) Ry, (9) 


(zj) (zj) 


where (-)* and (-)~ denote the positive and negative parts of the variables zij 
and z;, respectively. Further, the parameters a and 8 are chosen subject to the 
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constraints a — 8 = 1 and 8 > 0 (ie., a > 1) in order to propagate relevance 
conservatively throughout the network. Setting a = 1, the relevance flow is 
computed only with respect to the positive contributions (z) t in the forward 
pass. When alternatively parameterizing with, e.g., a = 2 and @ = 1, which is a 
common choice in literature, negative contributions are included as well, while 
favoring positive contributions. 

Recent works recommend a composite strategy of decomposition rule assign- 
ments mapping multiple rules purposedly to different parts of the network [25, 
29]. This leads to an increased quality of relevance attributions for the inten- 
tion of explaining prediction outcomes. In the following, a composite strategy 
consisting of the ¢-rule for dense layers and the a(@-rule with 3 = 1 for convolu- 
tional layers is used. Regarding LRP-based pruning, Yeom et al. [51] utilize the 
a-rule (9) with 8 = 0 for convolutional as well as dense layers. However, using 
Ê = 0, subparts of the network that contributed solely negatively, might receive 
no relevance. In our case of quantization, all individual weights have to be con- 
sidered. Thus, the a@-rule with 8 = 1 is used for convolutional layers, because 
it also includes negative contributions in the relevance distribution process and 
reduces gradient shattering. The LRP implementation is based on the software 
package Zennit [2], which offers a flexible integration of composite strategies and 
readily enables extensions required for the computation of relevance scores for 
weights. 


4.2 eXplainability-Driven Entropy-Constrained Quantization 


For our novel eXplainability-driven Entropy-Constrained Quantization (ECQ*), 
we modify the ECQ assignment function to optimally re-assign the weight clus- 
tering based on LRP relevances in order to achieve higher performance measures 
and compression efficiency. The rationale behind using LRP to optimize the ECQ 
quantization algorithm is two-fold: 


Assignment Correction: In the quantization process, the entropy regularization 
term encourages weight assignments to more populated clusters in order to min- 
imize the overall entropy. Since weights are usually normally distributed around 
zero, the entropy term also strongly encourages sparsity. In practice, this quan- 
tization scheme works well rendering sparse and low-bit neural networks for 
various machine learning tasks and network architectures [28, 48, 50]. 

From a scientific point of view, however, one might wonder why the shift of 
numerous weights from their nearest-neighbor clusters to a more distant cluster 
does not lead to greater model degradation, especially when assigned to zero. 
The quantization-aware re-training and fine-tuning can, up to a certain extent, 
compensate for this shift. Here, the LRP-generated relevances show potential 
to further improve quantization in two ways: 1) by re-adding “highly relevant” 
weights (i.e., preventing their assignment to zero if they have a high relevance), 
and 2) by assigning additional, “irrelevant” weights to zero (i.e., preventing their 
distance- and entropy-based assignment to a non-zero centroid). 
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Fig. 4. Weight relevance Rw,; vs. weight value wi; for the input layer (left) and output 
layer (right) of the full-precision MLP_GSC model (introduced in Sect. 5.1). The black 
histograms to the top and right of each panel display the distributions of weights (top) 
and relevances (right). The blue histograms further show the amount of relevance (blue) 
of each weight histogram bin. All relevances are collected over the validation set with 
equally weighted samples (i.e., by choosing Rn = 1). The value c measures the Pearsson 
correlation coefficient between weights and relevances. 


We evaluated the discrepancy between weight relevance and magnitude in 
a correlation analysis depicted in Fig. 4. Here, all weight values w;; are plotted 
against their associated relevance R»,, for the input layer (left) and output 
layer (right) of the full-precision model MLP_GSC (which will be introduced in 
Sect. 5.1). In addition, histograms of both parameters are shown above and to the 
right of each relevance-weight-chart in Fig. 4 to better visualize the correlation 
between wi; and Ry»,;. In particular, a weight of high magnitude is not necessarily 
also a relevant weight. And in contrast, there are also weights of small or medium 
magnitude that have a high relevance and thus should not be omitted in the 
quantization process. This phenomenon is especially true for layers closer to the 
input. The outcome of this analysis strongly motivates the use of LRP relevances 
for the weight assignment correction process of low-bit and sparse ECQ*. 


Regularizing Effect for Training: Since the previously described re-adding (which 
is also referred to as “regrowth” in literature) and removing of weights due to 
LRP depends on the propagated input data, weight relevances can change from 
data batch to data batch. In our quantization-aware training, we apply the STE, 
and thus the re-assignment of weights, after each forward-backward pass. 

The regularizing effect which occurs due to dynamic re-adding and remov- 
ing weights is probably related to the generalization effect which random 
Dropout [42] has on neural networks. However, as elaborated in the extensive 
survey by Hoefler et al. [21], in terms of dynamic sparsification, re-adding (“drop 
in”) the best weights is as crucial as removing (“drop out” ) the right ones. Instead 
of randomly dropping weights, the work in [9] shows that re-adding weights based 
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on largest gradients is related to Hebbian learning and biologically more plausi- 
ble. LRP relevances go beyond the gradient criterion, which is why we consider 
it a suitable candidate. 

In order to embed LRP relevances in the assignment function (1), we update 
the cost for the zero centroid (c = 0) by extending it as 


p Rwo - (dW, w) — AO 10g(P2)) (10) 


with relevance matrix Rwa containing all weight relevances R,,, of layer l 
with row/input index į and column/output index j, as specified in Eq. (7). The 
relevance-dependent assignment matrix AČ is thus described by: 


p Rwo - (a(WO, wo) — AM log(P®.)), if =0 

AO (W) = argmin 
© Law w®) — © tog, (P) , ife 40 
(11) 


where pis a normalizing scaling factor, which also takes relevances of the previous 
data batches into account (momentum). The term p Rwa) increases the assign- 
ment cost of the zero cluster for relevant weights and decreases it for irrelevant 
weights. 

Figure 5 shows an example of one ECQ* iteration that includes the following 
steps: 1) ECQ* computes a forward-backward pass through the quantized model, 
deriving its weight gradients. LRP relevances Rw are computed by redistribut- 
ing modified gradients according to Eq. (7). 2) LRP relevances are then scaled by 
a normalizing scaling factor p, and 3) weight gradients are scaled by multiplying 
the non-zero centroid values (e.g., the upper left gradient of —0.03 is multiplied 
by the centroid value 1.36). 4) The scaled gradients are then applied to the 
full-precision (FP) background model which is a copy of the initial unquantized 
neural network and is used only for weight assignment, i.e. it is updated with the 
scaled gradients of the quantized network but does not perform inference itself, 
5) The FP model is updated using the ADAM optimizer [23]. Then, weights are 
assigned to their nearest-neighbor cluster centroids. 6) Finally, the assignment 
Ax cost for each weight to each centroid is calculated using the A-scaled informa- 
tion content of clusters (i.e., J_ (blue) © 1.7, Io (green) = 1.0 and I} (purple) © 2-4 
in this example) and p-scaled relevances. Here, relevances above the exemplary 
threshold (i.e., mean Ry * 0.3) increase the cost for the zero cluster assignment, 
while relevances below (highlighted in red) decrease it. Each weight is assigned 
such that the cost function is minimized according to Eq. (11). 7) Depending on 
the intensity of the entropy and relevance constraints (controlled by À and p), 
different assignment candidates can be rendered to fit a specific deep learning 
task. In the example shown in Fig. 5, an exemplary candidate grid was selected, 
which is depicted at the top left of the Figure. The weight at grid coordinate 
D2, for example, was assigned to the zero cluster due to its irrelevance and the 
weight at C3 due to the entropy constraint. 
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Fig. 5. Exemplary ECQ* weight update. For simplicity, 3 centroids are used (i.e., 
symmetric 2 bit case). The process involves the following steps: 1) Derive gradients and 
LRP relevances from forward-backward pass. 2) LRP relevance scaling. 3) Gradients 
scaling. 4) Gradient attachment to full precision background model. 5) Background 
model update and nearest-neighbor clustering. 6) Computing of the assignment cost 
for each weight using the A-scaled information content of clusters and the p-scaled 
relevances. Assign each weight by minimizing the cost. 7) Choosing an appropriate 
candidate (of various À and p settings). 


In the case of dense or convolutional layers, LRP relevances can be computed 
efficiently using the autograd functionality, as mentioned in Sect. 4.1. For a clas- 
sification task, it is sensible to use the target class score as a starting point for 
the LRP backward pass. This way, the relevance of a neuron or weight describes 
its contribution to the target class prediction. Since the output is propagated 
throughout the network, all relevance is proportional to the output score. Con- 
sequently, relevances of each sample in a training batch are, in general, weighted 
differently according to their respective model output, or prediction confidence. 
However, with the aim of suppressing relevances for inaccurate predictions, it is 
sensible to weigh samples according to the model output, because a low output 
score usually corresponds to an unconfident decision of the model. 

After the relevance calculation of a whole data batch, the relevance scores 
Ry are transformed to their absolute value and normalized, such that Rwa) € 
[0,1]. Even though negative contributions work against an output, they might 
still be relevant to the network functionality, and their influence is thus consid- 
ered instead of omitted. On one hand, they can lead to positive contributions for 
other classes. On the other, they can be relevant to balancing neuron activations 
throughout the network. 

The relevance matrices Rwa) resulting from LRP are usually sparse, as can 
be seen in the weight histograms of Fig. 4. In order to control the effect of LRP 
in the assignment function, the relevances are exponentially transformed by Ø, 
applying a similar effect as for gamma correction in image processing: 


Ryw = (Ryw)? 
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with 8 € [0,1]. Here, the parameter 8 is initially chosen such that the mean 
n B 6 
relevance Rwa) does not change the assignment, e.g., p (Rw) = 1 or 


B = ——®2~. In order to further control the sparsity of a layer, the target 


sparsity p is introduced. If the assignment increases a layer’s sparsity by more 
than the target sparsity p, parameter ĝ is accordingly minimized. Thus, in ECQ*, 
LRP relevances are directly included in the assignment function and their effect 
can be controlled by parameter p. An experimental validation of the developed 
ECQ* method, including state-of-the-art comparison and parameter variation 
tests, is given in the following section. 


5 Experiments 


In the experiments, we evaluate our novel quantization method ECQ* using 
two widely used neural network architectures, namely a convolutional neural 
network (CNN) and a multilayer perceptron (MLP). More precisely, we deploy 
VGG16 for the task of small-scale image classification (CIFAR-10), ResNet18 
for the Pascal Visual Object Classes Challenge (Pascal VOC) and an MLP with 
5 hidden layers and ReLU non-linearities solving the task of keyword spotting 
in audio data (Google Speech Commands). 

In the first subsection, the experimental setup and test conditions are 
described, while the results are shown and discussed in the second subsection. 
In particular, results for ECQ* hyperparameter variation are shown, followed by 
a comparison against classical ECQ and results for bit width variation. Finally, 
overall results for ECQ* for different accuracy and compression measurements 
are shown and discussed. 


5.1 Experimental Setup 


All experiments were conducted using the PyTorch deep learning framework, ver- 
sion 1.7.1 with torchvision 0.8.2 and torchaudio 0.7.2 extensions. As a hardware 
platform we used Tesla V100 GPUs with CUDA version 10.2. The quantization- 
aware training of ECQ* was executed for 20 epochs in all experiments. As an 
optimizer we used ADAM with an initial learning rate of 0.0001. In the scope of 
the training procedure, we consider all convolutional and fully-connected layers 
of the neural networks for quantization, including the input and output layers. 
Note that numerous approaches in related works keep the input and/or output 
layers in full-precision (32 bit float), which may compensate for the model degra- 
dation caused by quantization, but is usually difficult to bring into application 
and incurs significant overhead in terms of energy consumption. 
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Google Speech Commands. The Google Speech Commands (GSC [46]) 
dataset consists of 105,829 utterances of 35 words recorded from 2,618 speakers. 
The standard is to discriminate ten words “Yes”, “No”, “Up”, “Down”, “Left”, 
“Right”, “On”, “Off”, “Stop”, and “Go”, and adding two additional labels, one 
for “Unknown Words”, and another for “Silence” (no speech detected). Follow- 
ing the official Tensorflow example code for training?, we implemented the cor- 
responding data augmentation with PyTorch’s torchaudio package. It includes 
randomly adding background noise with a probability of 80% and time shift- 
ing the audio by [—100, 100]ms with a probability of 50%. To generate features, 
the audio is transformed to MFCC fingerprints (Mel Frequency Cepstral Coef- 
ficients). We use 15 bins and a window length of 2000 ms. To solve GSC, we 
deploy an MLP (which we name MLP_GSC in the following) consisting of an 
input layer, five hidden layers and an output layer featuring 512, 512, 256, 256, 
128, 128 and 12 output features, respectively. The MLP_GSC was pre-trained for 
100 epochs using stochastic gradient descent (SGD) optimization with a momen- 
tum of 0.9, an initial learning rate of 0.01 and a cosine annealing learning rate 
schedule. 


CIFAR-10. The CIFAR-10 [26] dataset consists of natural images with a res- 
olution of 32 x 32 pixels. It contains 10 classes, with 6,000 images per class. 
Data is split to 50,000 training and 10,000 test images. We use standard data 
pre-processing, i.e., normalization, random horizontal flipping and cropping. 
To solve the task, we deploy a VGG16 from the torchvision model zoo*. The 
VGGI16 classifier is adapted from 1,000 ImageNet classes to ten CIFAR classes 
by replacing its three fully-connected layers (with dimensions [25,088, 4,096], 
(4,096, 4,096], [4,096, 1,000]) by two ([512, 512], [512, 10]), as a consequence of 
CIFAR’s smaller image size. We also implemented a VGG16 supporting batch 
normalization (“BatchNorm” in the following), i.e., VGG16_bn from torchvision. 
The VGGs were transfer-learned for 60 epochs using ADAM optimization and 
an initial learning rate of 0.0005. 


Pascal VOC. The Pascal Visual Object Classes Challenge 2012 (VOC2012) [13] 
provides 11,540 images associated with 20 classes. The dataset has been split 
into 80% for training/validation and 20% for testing. We applied normalization, 
random horizontal flipping and center cropping to 224 x 224 pixels. As a neural 
network architecture, the pre-trained ResNet18 from the torchvision model zoo 
was deployed. Its classifier was adapted to predict 20 instead of 1,000 classes and 
the model was transfer-learned for 30 epochs using ADAM optimization with an 
initial learning rate of 0.0001. 


3 https://github.com/tensorflow /tensorflow /tree/master /tensorflow /examples/ 
speech_commands. 
4 https: //pytorch.org/vision/stable/models.html. 
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Fig. 6. Hyperparameter p controls the LRP-introduced sparsity. 


5.2 ECQ* Results 


In this subsection, we compare ECQ* to state-of-the-art ECQ quantization, 
analysing accuracy preservation vs. sparsity increase. Furthermore, we inves- 
tigate ECQ* compressibility, behavior on BatchNorm layers, and an appropriate 
choice of hyperparameters. 


ECQ* Hyperparameter Variation. In ECQ*, two important hyperparame- 
ters, A and p, influence the performance and thus are optimized for the com- 
parative experiments described below. The parameter A increases the intensity 
of the entropy constraint and thus distributes the working points of each trial 
over a range of sparsities (see Fig. 6). The p hyperparameter defines an upper 
bound for the per-layer percentage of zero values, allowing a maximum amount 
of p additional sparsity, on top of the A-introduced sparsity. It thus implicitly 
controls the intensity of the LRP constraint. 

Figure6 shows results using several p values for the 4 bit (bw = 4) quan- 
tization of the MLP_GSC model. Note, that the variation of bit width bw is 
discussed below the comparative results. For smaller p, less sparse models are 
rendered with higher top-1 accuracies in the low-sparsity regime (e.g., p = 0.02 
or p = 0.05 between 30-50% total network sparsity). In the regime of higher 
sparsity, larger values of p show a better sparsity-accuracy trade-off. Note, that 
larger p do not only set more weights to zero but also re-add relevant weights 
(regrowth). For p = 0.4 and p = 0.5, both lines are congruent since no layer is 
achieving more than 40% additional LRP-introduced sparsity with the initial 6 
value (cf. Sect. 4.2). 


ECQ* vs. ECQ Analysis. As shown in Fig. 7, the LRP-driven ECQ* approach 
renders models with higher performance and simultaneously higher efficiency. 
In this comparison, efficiency is determined in terms of sparsity, which can be 
exploited to compress the model more or to skip arithmetic operations with 
zero values. Both methods achieve a quantization to 4 bit integer without any 
performance degradation of the model. Performance is even slightly increased 
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Fig. 7. Resulting model performances, when applying ECQ vs. ECQ* 4 bit quanti- 
zation on MLP-GSC (left) and VGG16 (right). Each point corresponds to a model 
rendered with a specific A which is a regulator for the entropy constraint and thus 
incrementally enhances sparsity. Abbreviations in the legend labels refer to bit width 
(bw) and target sparsity (p), which is defined in Sect. 4.2. 
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Fig. 8. Resulting model performances, when applying ECQ vs. ECQ* 4 bit quantiza- 
tion on VGG16, VGG16 with BatchNorm (BN) modules (left) and ResNet18 (right). 


due to quantization when compared to the unquantized baseline. In the regime 
of high sparsity, model accuracy of the previous state-of-the-art (ECQ) drops 
significantly faster compared to the LRP-adjusted quantization scheme. 

Regarding the handling of BatchNorm modules for LRP, it is proposed in 
literature to merge the BatchNorm layer parameters with the preceding linear 
layer [15] into a single linear transformation. This canonization process is sensi- 
ble, because it reduces the number of computational steps in the backward pass 
while maintaining functional equivalence between the original and the canonized 
model in the forward pass. 
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It has been further shown, that network canonization can increase expla- 
nation quality [15]. With the aim of computing weight relevance scores for a 
BatchNorm layer’s adjacent linear layer in its original (trainable) state, keeping 
the layers separate is more favorable than merging. Therefore, the a(-rule with 
8 = 1 is also applied to BatchNorm layers. The quantization results of the VGG 
architecture with BatchNorm modules and ResNet18 are shown in Fig. 8. 

In order to capture the computational overhead of LRP in terms of addi- 
tional training time, we compared the average training times of the differ- 
ent model architectures per epoch. Relevance-dependent quantization (ECQ*) 
requires approximately 1.2x, 2.4x, and 3.2x more processing time than baseline 
quantization (ECQ) for the MLP_GSC, VGG16, and ResNet18 architectures, 
respectively. This extra effort can be explained with the additional forward- 
backward passes performed in Zennit for LRP computation. More concretely, 
using Zennit as a plug-in XAI module, it computes one additional forward pass 
layer-wise and redistributes the relevances to the preceding layers according to 
the decomposition and aggregation rules specified in Sect. 4.1. For redistribution, 
Zennit computes one additional backward pass for e-rule associated layers and 
two additional backward passes for a(-rule associated layers in order to derive 
positive a and negative 8 relevance contributions. To recap, in the applied com- 
posite strategy, the e-rule is used for dense layers and the a(@-rule for convo- 
lutional layers and BatchNorm parameters, which results in the extra compu- 
tational cost for VGG16 and ResNet18 compared to MLP_GSC, which consists 
solely of dense layers. In addition, aggregation of relevances for convolutional 
filters is not required for dense layers. Note that the above mentioned values 
for additional computational overhead of ECQ* due to relevance computation 
can be interpreted as an upper-bound and that there are options to minimize 
the effort, e.g., by 1) not considering relevances for cluster assignments in each 
training iteration, 2) leveraging pre-computed outputs or even gradients from the 
quantized base model instead of separately computing forward-backward passes 
with a model copy in the Zennit module. Whereas 1) corresponds to a change in 
the quantization setup, 2) requires parallelization optimizations of the software 
framework. 


Bit Width Variation. Bit width reduction has multiple benefits over full- 
precision in terms of memory, latency, power consumption, and chip area effi- 
ciency. For instance, a reduction from standard 32 bit precision to 8 bit or 4 
bit directly leads to a memory reduction of almost 4x and 8x. Arithmetic with 
lower bit width is exponentially faster if the hardware supports it. E.g., since 
the release of NVIDIA’s Turing architecture, 4 bit integer is supported which 
increases the throughput of the RTX 6000 GPU to 522 TOPS (tera operations 
per second), when compared to 8 bit integer (261 TOPS) or 32 bit floating point 
(14.2 TFLOPS) [31]. Furthermore, Horowitz showed that, for a 45nm technol- 
ogy, low-precision logic is significantly more efficient in terms of energy and 
area [22]. For example, performing 8 bit integer addition and multiplication is 
30x and 19x more energy efficient compared to 32 bit floating point addition 
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Fig. 9. Resulting MLP_GSC model performances vs. memory footprint, when applying 
ECQ* with 2 bit to 5 bit quantization. 
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Fig. 10. Resulting VGG16 model performances vs. memory footprint, when applying 
ECQ* with 2 bit to 5 bit quantization. 


and multiplication. The respective chip area efficiency is increased by 116x and 
27x as compared to 32 bit float. It is also shown that memory reads and writes 
have the highest energy cost, especially when reading data from external DRAM. 
This further motivates bit width reduction because it can reduce the number of 
overall RAM accesses since more data fits into the same caches/registers when 
having a reduced precision. 

In order to investigate different bit widths in the regime of ultra low precision, 
we compare the compressibility and model performances of the MLP_GSC and 
VGG16 networks when quantized to 2 bit, 3 bit, 4 bit and 5 bit integer values 
(see Figs.9 and 10). Here, we directly encoded the integer tensors with the 
DeepCABAC codec of the ISO/IEC MPEG NNR standard [24]. The least sparse 
working points of each trial, i.e., the rightmost data points of each line, show the 
expected behaviour, namely that compressibility is increased by continuously 
reducing the bit width from 5 bit to 2 bit. However, this effect decreases or 
even reverses when the bit width is in the range of 3 bit to 5 bit. In other 
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words, reducing the number of centroids from 2° = 32 to 2? = 8 does not 
necessarily lead to a further significant reduction in the resulting bitstream size 
if sparsity is predominant. The 2 bit quantization still minimizes the size of the 
bit stream, even if, especially for the VGG model, more accuracy is sacrificed for 
this purpose. Note that compressibility is only one reason for reducing bit width 
besides, for example, speeding up model inference due to increased throughput. 


ECQ* Results Overview. In addition to the performance graphs in the previ- 
ous subsections, all quantization results are summarized in Table 1. Here, ECQ* 
and ECQ are compared specifically for a 2 and 4 bit quantization as these fit 
particularly well to power-of-two hardware registers. The ECQ* 4 bit quantiza- 
tion achieves a compression ratio for VGG16 of 103x with a negligible drop in 
accuracy of —0.1%. In comparison, ECQ achieves the same compression ratio 
only with a model degradation of —1.23% top-1 accuracy. For the 4 bit quanti- 
zation of MLP_GSC, ECQ* achieves its highest accuracy (“drop”, i.e., increase 
of +0.71% compared to the unquantized baseline model) with a compression 
ratio that is almost 10% larger compared to the highest achievable accuracy 
of ECQ (+0.47%). For sparsities beyond 70%, ECQ significantly reduces the 
model’s predictive performance, e.g., at a sparsity of 80.39% ECQ shows a loss 
of —1.40% whereas ECQ* only degrades by —0.34%. ResNet18 sacrifices perfor- 
mance at each quantization setting, but especially for ECQ* the accuracy loss is 
negligible. The 2 bit representations of ResNet18 sacrifice more than —5% top-1 
accuracy compared to the unquantized model, which may be compensated with 
more than 20 epochs of quantization-aware training, but is also due to the higher 
complexity of the Pascal VOC task. 

And finally, the 2 bit results in Table 1 show two major findings: 1) With only 
a minor model degradation all weight layers of the MLP_GSC and VGG networks 
can also be quantized to only 4 discrete centroid values while still maintaining a 
high level of sparsity, 2) ECQ* renders higher compressible models in comparison 
to ECQ, as indicated by the higher compression ratios CR. 
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Table 1. Quantization results for ECQ* for 2 bit and 4 bit quantization: highest 
accuracy, highest compression gain without model degradation (if possible) and highest 
compression gain with negligible degradation. Underlined values mark the best results 
in terms of performance and compressibility with negligible drop in top-1 accuracy. 


Model Prec.* | Method? | Acc. (%) | Acc. drop Te! (%)° | Size (kB) | CR? 

CIFAR-10 

VGG16 W4A16 | ECQ* 92.27 +1.55 41.39 4,446.39 | 13.48 
W4A16 | ECQ* =| 90.86 +0.14 91.95 933.99 64.17 
W4A16 | ECQ* =| 90.62 —0.10 94.67 584.16 102.59 
W4A16 | ECQ 92.09 +1.37 29.88 4,658.01 | 12.87 
W4A16 | ECQ 91.03 +0.31 88.03 1,246.27 | 48.09 
W4A16 | ECQ 89.49 —1.23 93.97 585.40 102.37 
W2A16 | ECQ* = | 90.42 —0.30 83.23 1.394,52 | 42.98 
W2A16 | ECQ 90.19 —0.53 81.58 1,486.76 | 40.31 

Google Speech Commands 

MLP GSC W4A16 | ECQ* | 88.95 +0.71 65.14 128.03 20.05 
W4A16 | ECQ* =| 88.34 +0.10 78.77 92.46 27.77 
W4A16 | ECQ* | 87.89 —0.34 80.45 87.52 29.33 
W4A16 | ECQ 88.71 +0.47 59.95 139.96 18.34 
W4A16 | ECQ 88.32 +0.08 70.74 98.32 26.11 
W4A16 | ECQ 86.84 —1.40 80.39 69.67 36.85 
W2A16 | ECQ* | 87.46 —0.78 83.97 68.77 37.33 
W2A16 | ECQ 87.72 —0.52 77.55 78.54 32.69 

Pascal VOC 

ResNet18 W4A16 | ECQ* 73.13 —0.27 32.82 3,797.97 | 11.79 
W4A16 | ECQ* 72.78 —0.62 68.67 2,246.71 | 19.93 
W4A16 | ECQ* 72.48 —0.92 74.65 1,946.22 | 23.01 
W4A16 | ECQ 72.95 —0.45 24.63 3,882.62 | 11.53 
W4A16 | ECQ 72.56 —0.84 61.12 2,480.59 | 18.05 
W4A16 | ECQ 71.74 —1.66 74.88 1,841.82 | 24.32 

aWgrAy indicates a quantization of weights and activations to x and y bit. 

bECQ refers to ECQ* w/o LRP constraint. 

°Sparsity, measured as the percentage of zero-valued parameters in the DNN. 

4Compression ratio (full-precision size/compressed size) when applying the DeepCABAC 

codec of the ISO/IEC MPEG NNR standard [24]. 


6 Conclusion 


In this chapter we presented a new entropy-constrained neural network quan- 
tization method (ECQ*), utilizing weight relevance information from Layer- 
wise Relevance Propagation (LRP). Thus, our novel method combines con- 
cepts of explainable AI (XAI) and information theory. In particular, instead 
of only assigning weight values based on their distances to respective quantiza- 
tion clusters, the assignment function additionally considers weight relevances 
based on LRP. In detail, each weight’s contribution to inference in interaction 
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with the transformed data, as well as cluster information content is calculated 
and applied. For this approach, we first utilized the observation that a weight’s 
magnitude does not necessarily correlate with its importance or relevance for a 
model’s inference capability. Next, we verified this observation in a relevance vs. 
weight (magnitude) correlation analysis and subsequently introduce our ECQ* 
method. As a result, smaller weight parameters that are usually omitted in a 
classical quantization process are preserved, if their relevance score indicates a 
stronger contribution to the overall neural network accuracy or performance. 

The experimental results show that this novel ECQ* method generates low 
bit width (2-5 bit) and sparse neural networks while maintaining or even improv- 
ing model performance. Therefore, in particular the 2 and 4 bit variants are 
highly suitable for neural network hardware adaptation tasks. Due to the reduced 
parameter precision and high number of zero-elements, the rendered networks 
are also highly compressible in terms of file size, e.g., up to 103x compared to 
the full-precision unquantized DNN model, without degrading the model per- 
formance. Our ECQ* approach was evaluated on different types of models and 
datasets (including Google Speech Commands, CIFAR-10 and Pascal VOC). 
The comparative results vs. state-of-the-art entropy-constrained-only quantiza- 
tion (ECQ) show a performance increase in terms of higher sparsity, as well as 
a higher compression. Finally, also hyperparameter optimization and bit width 
variation results were presented, from which the optimal parameter selection for 
ECQ* was derived. 
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Abstract. Explainable machine learning and uncertainty quantification 
have emerged as promising approaches to check the suitability and under- 
stand the decision process of a data-driven model, to learn new insights 
from data, but also to get more information about the quality of a specific 
observation. In particular, heatmapping techniques that indicate the sen- 
sitivity of image regions are routinely used in image analysis and interpre- 
tation. In this paper, we consider a landmark-based approach to generate 
heatmaps that help derive sensitivity and uncertainty information for an 
application in marine science to support the monitoring of whales. Single 
whale identification is important to monitor the migration of whales, to 
avoid double counting of individuals and to reach more accurate popu- 
lation estimates. Here, we specifically explore the use of fluke landmarks 
learned as attention maps for local feature extraction and without other 
supervision than the whale IDs. These individual fluke landmarks are 
then used jointly to predict the whale ID. With this model, we use sev- 
eral techniques to estimate the sensitivity and uncertainty as a function 
of the consensus level and stability of localisation among the landmarks. 
For our experiments, we use images of humpback whale flukes provided 
by the Kaggle Challenge “Humpback Whale Identification” and compare 
our results to those of a whale expert. 


Keywords: Attention maps - Sensitivity - Uncertainty - Whale 
identification 
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1 Introduction 


For many scientific disciplines, reliability and trust in a machine learning result 
are of great importance, in addition to the prediction itself. Two key values that 
can contribute significantly to this are the interpretability and the estimation of 
uncertainty: 


— An interpretation aims at the presentation of properties of a machine learn- 
ing model (e.g., a decision process of a neural network) in a way that it is 
understandable to a human [21]. One possibility to obtain an interpretation is 
sensitivity analysis which provides information about how the models’ output 
is affected by small or specifically chosen changes in the input [18]. 

— Uncertainty is the quantity of all possible changes in the output that result 
from uncertainties already included in the data (aleatoric/data uncertainty) 
or a lack of knowledge of the machine learning model (epistemic /model uncer- 
tainty) [6]. 


Both uncertainty quantification and sensitivity analysis have become a broad 
field of research in recent years, especially for developing methods to check the 
suitability and to better understand the decision-making process of a data-driven 
model [6,21,24]. However, so far, the two areas have usually been considered 
separately, although a joint consideration has clear benefits, since the analysis 
of sensitivity can often be considered as a part or first step towards uncertainty 
quantification. 

In this chapter, we will consider a use case from marine science to demon- 
strate the usefulness of a joint use of sensitivity and uncertainty quantification 
in landmark-based identification. In particular, we look at the identification of 
whales by means of images of their fluke. Whale populations worldwide are 
threatened by commercial whaling, global warming, and the struggle for food in 
competition with the fishing industry [33]. A protection of whales is essentially 
supported by the reconstruction of the spatio-temporal migration of whales, 
which in turn is based on the (re)identification of whales. Individual whales 
can be identified by the shape of their whale flukes and their unique pigmen- 
tation [13]. Three features in particular play a crucial role for whale experts in 
distinguishing between individual whales (see Fig. 1): 


— Pigmentation-based features. These features correspond to coloured patches 
on the fluke, forming unique patterns. They are very clearly visible to the 
human eye. They can change significantly within the first few years of whale 
life and in extremely cold water (for example, Antarctica, but also Greenland 
and the North Atlantic). They may be partially obscured by heavy diatom 
growth, characterized by a yellow-orange appearance of the fluke. 

— Fluke shape. This feature is reliable and robust. The outer 20% of the tail may 
become more distorted and change over time, but the inner 80% and V-notch 
are reliable and stable. Although it is difficult to detect by the human eye, it 
has proven to be very useful for machine learning-based approaches [14, 15, 25]. 
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Fig. 1. Important characteristics of a whale fluke. 


— Scars. The surface of the fluke usually shows contrasting scars. However, 
the contrast can vary greatly and the scars may change over time. Certain 
scars grow with the whale, such as killer whale rake marks that form parallel 
lines or barnacle marks that form circles. In addition, lighting conditions can 
significantly affect the detectability of scars. 


For whale monitoring, whale researchers often use geo-tagged photos with time 
and location information to reconstruct activities. Since manual analysis is too 
costly and thus a huge amount of data remained unused, current approaches 
focus on machine learning [14,15,25]. 

Despite the accuracy observed in recent competitions [29], limited effort 
has been devoted to actually quantify sensitivity in the prediction and identify 
sources of uncertainty. We argue that uncertainty identification remains a cen- 
tral topic requiring attention and propose a methodology based on landmarks 
and their spatial sensitivity and uncertainty to answer a number of scientific 
questions useful for experts in animal conservation. Specifically, we tackle the 
following questions: 


— Which parts of the fluke are more consistently useful to identify whales? A 
whale fluke changes with time and therefore, characteristic features of a fluke 
may no longer be present and therefore not visualized in the interpretation 
tool results. 

— Can landmarks together with uncertainty and sensitivity indicate the suit- 
ability of images for identification? Suitability is influenced, for example, by 
image quality, position, and size of the object, but also by the presence of 
relevant features. 


These goals are formulated from the perspective of whale research, but are 
also intended to raise relevant questions from the perspective of machine learn- 
ing, such as the usefulness of interpretation tools to improve models. In general, 
the task of re-identifying objects or living beings from images and is a common 
topic [2, 16,26], and the approach and insights presented in this paper can also 
be applied to similar tasks from other fields. 
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2 Related Work 


Self-explainable Deep Learning Models. Although the vast majority of 
methods to improve the interpretability and explainability of deep learning mod- 
els are designed to work post-hoc [19,28,32], i.e. the important parts of the input 
are highlighted while the model itself remains unmodified, a few approaches aim 
at modifying the model so that its inherent interpretability is enhanced, also 
referred to as self-explainable models [23]. This has the advantage that the inter- 
pretation is actually part of the inference process, rather than being computed a 
posteriori by an auxiliary interpretation method, resolving potential trustworthi- 
ness issues of post-hoc methods [22]. The visual interpretation can be obtained, 
for example, by incorporating a global average pooling after the last convolu- 
tional layer of the model [39] or by levering a spatial attention mechanism [36]. 
Our self-explainable method is inspired by [36] and [38], and learns a fixed set of 
landmarks, along with their associated attention maps, in a weakly supervised 
setting by only using class labels. To gain further insight, the landmarks can be 
used for sensitivity analysis and uncertainty quantification. 


Uncertainty Quantification. The field of uncertainty quantification has 
gained new popularity in recent years, especially for determining the uncertainty 
of complex models such as neural networks. In most applications, the predictive 
uncertainty is of interest, i.e. the uncertainty that affects the estimation from 
various sources of uncertainty, originating from the data itself (aleatoric uncer- 
tainty) and arising from the model (model uncertainty). These sources are often 
not negligible, especially in real-wold applications, and must be determined for a 
comprehensive statement about the reliability and accuracy of the result. Several 
works have been carried out such as [5,30], which explore Monte Carlo dropout or 
quantify uncertainty analysing the softmax output of neural networks. [7,12,34] 
give comprehensive overviews of the field, where [6] specifically focuses on the 
applicability in real-world scenarios. 


Sensitivity Analysis. This kind of analysis is usually considered in the context 
of explainable machine learning. Here, a set input variables, such as pixel values 
in an image region or a unit in some of the model’s intermediate representa- 
tions [3,31], are perturbed, and the effect of such changes on the result is consid- 
ered. This approach helps to understand the decision process and causes of uncer- 
tainties, and to gain insights into salient features that can be spatial, temporal 
or spectral. According to [21], sensitivity analysis approaches belong to inter- 
pretation tools, as they transform complex aspects such as model behavior into 
concepts understandable by a human [19,24]. Many approaches use heatmaps 
that visualize the sensitivity of the output to perturbations of the input, the 
attention map of the classifier model, or the importance of the features [11]. 
These tools are extremely helpful and have been used recently to infer new sci- 
entific knowledge and discoveries and to improve the model [21,27,31]. Probably 
the best known principle is study of the effects of masking selected regions of 
the input, which is systematically applied in occlusion sensitivity maps [20]. For 
more details, including specific types of interpretation and further implementa- 
tion, we refer to recent studies [1,8,9]. 
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Sensitivity vs. Uncertainty. There are significant differences between the 
analysis of uncertainties and sensitivity, and previous applications mostly con- 
sider only one of the two. Sensitivity analysis focuses more on the input and 
the effect of modifications on the predictions, while uncertainty quantification 
focuses on the propagation of uncertainties in the model. Nevertheless, there 
are also strong correlations, as shown in [18]. Sensitivity analysis, for example, 
explores the causes and importance of specific uncertainties in the input data for 
the decision, while uncertainty analysis describes the whole set of possible out- 
comes. Both consider variations in the input and their influence on the output 
to derive statements for decision-making. Our work is based on the preliminary 
work of [14], in which occlusion sensitivity maps are created by systematically 
covering individual areas in images of whale flukes in order to identify the char- 
acteristic features of flukes for whale identification. Here, we propose to learn 
a set of compact attention maps such that each specializes in the detection of 
a fluke landmark. These learned landmarks are use to extend [14] by a com- 
bined analysis of the sensitivity of the classification to each landmark and their 
uncertainty. 


3 Humpback Whale Data 


3.1 Image Data 


In this work, we use a set of humpback whale images from the Kaggle Challenge 
“Humpback Whale Identification”. More specifically, we process their tails, called 
flukes (see Fig. 1). The data set consists of more than 67.000 images, in which 
10.008 different whale individuals, i.e., 10.008 different classes, are represented. 
We pruned the dataset and used only the 1.646 classes that contained three or 
more images in the training set of the challenge. For our experiments, we restrict 
ourselves to use images in the training set because the test set does not provide 
reference information, as it is generally the case for Kaggle challenges. We split 
the images into a training set Vrain = {£1,..., £N} (9.408 images) and a test 
set Vest = {£1,..., £r} (1.646 images, or one per class, i.e. a specific whale 
individual). The number of images per set is given by N and T, respectively. 
The set Xe = {£1,..., £R} describes a subset that includes R images for one 
specific class c. 


3.2 Expert Annotations 


A domain expert participated to the study and provided human annotation of 
remarkable features helping in the discrimination of the whale individuals. For 
each annotation the expert was provided with a pair of images and asked to 
mark a set of features helping in discriminating whether the images were of 
the same individual or not. Three features are generally used by the expert (per- 
sonal communication), who therefore provided three features per image analysed. 
Some examples are shown in Fig. 5a. 
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4 Methods 


4.1 Landmark-Based Identification Framework 


Ae (0, ee 


ly € R? 


Fig. 2. Given the image of a fluke, we extract the feature tensor Z using a CNN. A set 
of compact attention maps A, excluding a background map, is then used to extract 
localized features from Z. These features are then averaged and used for classification 
into C classes, each corresponding to an individual whale. 


We propose to learn a set of discriminant landmarks for whale identification such 
that the model uses evidence from each one separately in order to solve the task. 
The rationale behind this approach is twofold: 


1. Each landmark will gather evidence from a different region of the image, 
effectively resulting in an ensemble of diverse classifiers, each using a different 
subset of the data. This independence between the different classifiers provides 
an improved uncertainty estimation. 

2. Since landmarks are trained to attend to a small region of the image, it 
becomes very easy to visualize where the evidence is coming from with no 
further computation, thus inherently providing an enhanced level of inter- 
pretability. 


In order to learn to detect informative landmarks without further supervi- 
sion than the whale ID, we use an approach inspired by [38]. Likewise, we aim at 
learning to detect a fixed set of keypoints in the image to establish at which loca- 
tions landmarks are to be extracted. Unlike [38], we do not use an hourglass-type 
architecture, but a standard classification CNN with a reduced downsampling 
rate in order to allow for a better spatial resolution. Another major difference 
is that we do not use any reconstruction loss and therefore need no decoding 
elements. 
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Given an image X € R3XMPXND and a CNN with a downsampling factor 
g 8 


D, the H-channel tensor resulting from applying the CNN to X is: 
Z = CNN(X; 0) € REXMXN (1) 


We obtain the K + 1 attention maps, representing the K keypoints and 
the background, by applying a linear layer to each location of Z, which is 
equivalent to a 1 x 1 convolutional filter parametrized by the weight matrix 
Wattn € R¥ X (K+) followed by a channel-wise softmax: 


A = softmax(Z * Watin) € REDS MXN. (2) 


Each attention map Ax, except for the (K + 1)'", which captures the back- 
ground, is applied to the tensor Z in order to obtain the corresponding landmark 
vector: 


M N 
lx = X X Ag(u,v)Z(u,v) € RË. (3) 
u=1v=1 
Each landmark 1, undergoes a linear operation in order to generate the C 
classification scores, where C is the total number of classes, associated to it: 


Yk = l W class G RŪ. (4) 


We apply different losses to the classification scores y, the landmark feature 
vectors l and the attention maps A. For the classification scores, we use a cross- 
entropy loss, providing the only gradients for learning the weights of the linear 
operator Welass € R¥ 5C: 


Lclass (y, c) = log (Seo "i 


In addition, we make sure that landmark vectors are similar across images of 
the same individual. We use a triplet loss for each landmark k, which is computed 
on the landmark vector 17, used as anchor in the triplet loss, a positive vector 
from the corresponding landmark stemming from an image of the same class, 1%, 
and a negative one from a different class 1}: 


Liriplet (lk; I, Ig) = max(||lk — Ville — [lle — Tella + 1,0) (6) 


Regarding the losses applied to the landmark attention maps, which have 
the role of ensuring learning a good set of keypoints for landmark extraction, we 
apply two losses: 


L Era 7h (An) + F(x) i 
K ? 

which aims at encouraging each attention map to be concentrated around its 

center of mass by minimizing the variances of each attention map, 72(A,) and 

o2(Ax), across both spatial dimensions and 


Leonel A) 
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K 
Lmax(A) = Ppa - . (8) 


which ensures that all landmarks are present in each image. 
These four losses are combined as a weighted sum to obtain the final loss: 


L= Aclass£ class + Atriplet Ltriplet + Aconc£ conc + AmaxL max: (9) 


where Aclass; Atriplet; Aconc are scalar hyperparameters. 


4.2 Uncertainty and Sensitivity Analysis 


Patch-Based Occlusion Sensitivity Maps. Determining occlusion sensitiv- 
ity maps is a strategy developed by [37] to evaluate the sensitivity of a trained 
model to partial occlusions in an input image. The maps visualize which regions 
contribute positively and which contribute negatively to the result. The approach 
is to systematically mask different regions for a given input image, choosing a 
rectangular patch in our case. Two parameters, namely patch size p and step 
size, are chosen by the user, and the choice affects the result in terms of preci- 
sion and smoothness. In the area around position u occluded by the patch, the 
pixel-wise results of the classifier for each class are compared with the results 
obtained after part of the image was occluded. For the expected class c, the score 
s is predicted for the corresponding position u of the patch. The difference Scu 
is given by. 

Sen = Sa Say (10) 
where the original predicted score for each class is denoted by se and the pre- 
dicted score based on occlusion is given by 8,,,. Performing this for the entire 
image yields a heat map of occlusion sensitivity. 


Landmark-Based Sensitivity Analysis. Similarly to the patch-based occlu- 
sion sensitivity maps presented previously, landmark-based sensitivity analysis 
eliminates individual landmarks, by setting all the elements in the corresponding 
feature vector lẹ to zero, in order to analyze their effect on the output, allowing 
to understand the impact that each landmark has on the final score. In addition 
to this, we also measure the impact that removing a landmark has on the accu- 
racy across the validation set. In both cases, the same landmark k is removed 
for all images in the test, thus preventing it from contributing to the final score. 
This allows us to probe the importance of each landmark across the whole test 
set. 


Landmark-Based Uncertainty Analysis. Due to occlusions, unreliable fluke 
features or wrongly placed landmarks, different groups of landmarks in the same 
image may provide evidence for conflicting outputs. Similarly, each individual 
landmark detector may receive conflicting signals from the previous layer about 
where to place the landmark on the image. This disagreement can be used to 
In order to measure this disagreement, we perform two experiments applying 
different types of Monte Carlo dropout (i.e. test time dropout) to the landmarks. 
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Class Uncertainty Through Whole Landmark Dropout. We randomly choose half 
of the landmarks and use them to obtain a class prediction y,. We perform this 
operation R times to obtain a collection of class predictions R = {y1,..., yp}. 
The agreement score a is then computed as the proportion of random draws that 
output the most frequently predicted class: 


2 
a= F 2 [vr =mode(R)]. (11) 


r=1 


Landmark Spatial Uncertainty Through Feature Dropout. In this case we apply 
standard dropout to the feature tensor Z, thus perturbing the landmark atten- 
tion maps A. Landmarks that have not been reliably detected will be more 
sensitive to these perturbations, resulting in higher spatial uncertainty. 


5 Experiments and Results 


Our experiments address landmark detection focusing on the uncertainty and 
sensitivity of landmarks, and compare to previous results from patch-based 
occlusion sensitivity maps from [14] by means of whale identification. Further- 
more, the landmarks and occlusion sensitivity maps are compared to the domain 
knowledge of an expert. 

Our method allows to easily reach conclusions at both the dataset level and 
the image level. For one particular image, due to the spatial compactness of 
the landmark attention maps, we can visualize the contribution of each land- 
mark to the final classification score. In addition, the fact that each landmark 
tends to focus on the same fluke features across images allows us to analyze the 
importance of each landmark at the dataset level. 


5.1 Experimental Setup 


We use a modified classification CNN, a ResNet-18 [10], with reduced downsam- 
pling, by a factor of four, in order to preserve better spatial details. For the final 
loss we used the same weight for each of the sub-losses Atripiet = Acone = Amax 
Aclass = 1. We use Adam as an optimizer, with the ResNet-18 model starting 
with a learning rate of 1074, while Watin and Welass are optimized starting with 
a learning rate of 107?. After every epoch, the learning rates are divided by 2 
if the validation accuracy decreases. No image pre-processing is used. The top-1 
accuracy reaches 86% on the held-out validation set. For comparison, we trained 
the same base model without the attention mechanism, obtaining an accuracy of 
82%, showing that the landmark-based attention mechanism does not penalize 
the model’s performance. 

For comparison, we use our previously computed occlusion sensitivity maps 
presented in [14], which were based on the data and scores of the classification 
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framework of the second winner solution! of the Kaggle Challenge. For pre- 
processing, the framework applies two steps to the raw image. First, the chosen 
framework automatically performs image cropping in order to reduce the image 
content to the fluke of the whale. The cropped images are resized to an uniform 
size of 256 px x 512 px. In the second step, the framework performs standard- 
normalization on the input images. The architecture is based on ResNet-101 [10] 
utilizing triplet loss [35], ArcFace loss [4], and focal loss [17]. With this model, 
we reach a top-5 accuracy of 94.2%. 


5.2 Uncertainty and Sensitivity Analysis of the Landmarks 


1.04 
0.8 4 
0.8 4 
0.6 4 
> 0.6 
g 2 
g 0.44 3 
< 0.4 
0.2 4 
0.2 —— Highest class score 
ae —— Landmark dropout agreement 
d --- Ideal calibration 
0.0 r r r 7 r st ee 
2 4 6 8 10 0 0.2 0.4 0.6 0.8 
Number of landmarks Confidence 


Fig. 3. Left: Average score and standard deviation by randomly selecting an increas- 
ing number of landmarks. Right: Expected accuracy as a function of two different 
confidence scores: the highest class score after softmax, and the agreement between 
100 landmark dropout runs. 


Figure 3 (left) shows the uncertainty of the predicted score, i.e. how much the 
result score varies when a certain number of landmarks is used. It can be seen 
that the uncertainty becomes smaller the more landmarks are used. The reason 
for this is that usually several features are used for identification - by the domain 
expert as well as by the neural network - and with increasing number of land- 
marks the possibility to cover several features increases. Figure 3 (right) displays 
the expected accuracy for varying levels of confidence estimates. We compare 
two estimates: the maximum softmax loss, in blue, and the agreement between 
100 runs of MC landmark dropout with a dropout rate of 0.5, in orange. We can 
see that the latter follows more closely the behaviour of an ideally calibrated 
estimate (dashed line). 


' 2nd place: https://github.com/SeuTao/Humpback- Whale-Identification-Challenge- 
2019 _2nd_palce_solution. 
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Fig. 4. Top: Average sensitivity heatmap rendered on the landmark locations of one 
image, representing the average reduction in the score of the correct class after removing 
each landmark. Bottom: Average loss in accuracy, in percent points, after removing 
each landmark. Photo CC BY-NC 4.0 John Calambokidis. 


5.3 Heatmapping Results and Comparison with Whale Expert 
Knowledge 


Figure 4 shows the mean landmark sensitivity (top), as well as the loss of accu- 
racy after removing landmarks (bottom), calculated over the complete data set. 
When compared to the landmarks near the fluke tips, it can be seen that the 
landmarks near the notch change the score the most, and flip the classification 
towards the correct class the most often. This is consistent with the fact that 
the interior of a fluke changes rather little over time, while the fluke tips can 
change significantly over time. Also, the pose and activity of the whale when 
the images are captured might explain this behavior. It is worth noting that all 
the attention is concentrated along the trailing edge of the fluke. This may be 
due to the fact that it is the area of the fluke that is most reliably visible in the 
images, since the leading edge tends to be under water in a number of photos. 

In the following, we examine the landmark-based and patch-based tools in 
terms of the features considered as important by the whale expert on individual 
images. We show the results on two pairs of images such that each pair belongs 
to the same individual. Figure 5a highlights the main areas the expert focused 
on in order to conclude whether they do belong to the same individual or not 
after inspecting both images side-to-side. Note the tendency of the expert of 
annotating just a small number of compact regions. 

The heatmaps obtained using patch-based occlusion are shown in Fig. 5b. 
Although the fluke itself is recognised as being important to the classification, 
no particular area is highlighted, except for one case where the whole trailing 
edge appears to be important. In addition, some regions outside of the fluke 
seem to have a negative sensitivity, pointing at the possibility of an artifact 
in the dataset that is being used by the model. This was observed in previous 
publications [14], where authors concluded that patch-based occlusion was using 
the shape of the entire fluke, rather than specific, localised patterns. 

The results of the landmark-based approach, in Fig. 5c, show more expert- 
like heatmaps, with the evidence for and against a match always located on the 
fluke and generally around the trailing edge and close to the notch. In each case, 
only a few small regions are responsible for the evidence in favor of assigning 
each pair to the same individual. However, although both the expert and the 
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(a) Expert annotations (b) Occlusion-based (c) Landmark-based 


Fig. 5. Heatmaps of attribution. Dark blue/red areas highlight the regions that are 
estimated to provide evidence for/against the match. The top two pairs are matching 
pairs (same individual) while the bottom one is not a match. (Color figure online) 


landmark-based method have a tendency of pointing at the same general areas 
around the trailing edge with compact highlights, we do not observe a consistent 
overlap with the expert annotated images. This may be due to constraints in both 
the expert and the landmark-based highlights. Unlike the expert, the landmark- 
based approach tends to focus, by design, in the areas of the fluke that are most 
reliably visible. The expert, on the other hand, explores all visible fluke features 
and highlights them in a non-exhaustive manner. On the top image pair, a region 
that is also annotated by the expert on the left fluke provides most of the positive 
evidence, but a feature close to the leading edge is ignored. This is probably due 
to the model learning that the leading edge is less reliable, since it is under water 
in a large number of photos. On the middle pair, the area to the left of notch 
is assigned a negative sensitivity while being annotated as important by the 
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Fig. 6. Spatial uncertainty of each landmark on different whales determined by means 
of 500 dropout runs on the feature tensor Z. Each disk represents the location of 
a landmark in one run and each of the ten landmarks is colored consistently across 
images. Top: The test images with the lowest uncertainty. Bottom: The test images 
with the highest uncertainty. (Color figure online) 


expert. On the bottom pair we see that only the landmarks closest to the notch 
are used by the model to decide that the images do indeed belong to different 
individuals, while the expert has also annotated a region close to the fluke tip, 
which the landmark-based model systematically ignores, likely due to the fact, 
as with the leading edge, that the tips are less reliably visible in the images. 


5.4 Spatial Uncertainty of Individual Landmarks 


The visualizations in Fig.6 display the six images in the test set with the lowest 
and with the highest uncertainty, each on a different individual. The colored 
disks represent the positions of each landmark across 500 random application 
of dropout, with a dropout probability of 0.5, to the feature tensor Z. The col- 
ors are consistent (e.g. landmark 5, as seen in Fig.4 is always represented in 
dark blue). The top rows tend to contain images with clearly visible flukes in 
a canonical pose. As we can see, the detected keypoints do behave as land- 
marks, each specializing in a particular part of the fluke, even if no particu- 
lar element of the loss was designed to explicitly promote this behaviour. The 
bottom rows contain images with either substantial occlusions or uncommon 
poses. This shows how the spatial uncertainty uncovered by MC dropout can be 
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used to detect unreliably located landmarks, which in turn can be used to find 
images with problematic poses and occlusions that are likely to be unsuitable for 
identification. 


6 Conclusion and Outlook 


In this work, we explore the use of landmark detection learning using only class 
labels (i.e. whale identities) and apply it to gain insights into which fluke parts are 
relevant to the model’s decision in the context of cetacean individual identifica- 
tion. Our experiments show that, compared to patch-based occlusion mapping, 
our approach highlights regions in the images that are systematically located 
along the central part of the trailing edge of the fluke, which is the part most 
reliably visible in the images. At the same time, the landmarks highlight com- 
pact regions that are much more expert-like than the baseline OSM heatmaps. 
In addition, we show that the agreement of random subsets of the landmarks is 
a better estimate of the expected error rate than the softmax score. However, 
there seems to be little agreement between the specific regions chosen by the 
expert and the landmark-based highlights. 

The use of landmarks makes it easy to match them across images, since each 
landmark develops a tendency to specialize on a particular region of the fluke. 
This allowed us to study their average importance for the whole validation set, 
leading us to conclude that the areas of the trailing edge right next to the notch 
tend to be the most relied upon. This is probably due to the to the higher 
temporal stability of the region around the notch, which is less exposed and 
thus less likely to develop scars, and to the fact that the trailing edge is the 
part of the fluke most often visible in the photos. Is also worth noting that the 
proposed method is inherently interpretable, thus not only guaranteeing that the 
generated heatmaps are relevant to the model’s decision, but also doing so at a 
negligible computational cost, requiring to perform inference once and not using 
any gradient information. In addition, the accuracy obtained is noticeably higher 
than a model with the same base architecture but no attention mechanism. 

In spite of these advantages, we also observed an inherent limitation of the 
method when compared to the expert annotations. Our landmark-based model 
requires to find all landmarks on each image, resulting in a tendency to only 
focus on the areas of the fluke that are most reliably visible and discarding those 
that are often occluded, such as the tips and the leading edge. Designing a model 
that is free to detect a varying number of landmarks is a potential path towards 
even more expert-like explanations. 
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Abstract. In recent years, artificial intelligence and specifically artifi- 
cial neural networks (NNs) have shown great success in solving com- 
plex, nonlinear problems in earth sciences. Despite their success, the 
strategies upon which NNs make decisions are hard to decipher, which 
prevents scientists from interpreting and building trust in the NN pre- 
dictions; a highly desired and necessary condition for the further use and 
exploitation of NNs’ potential. Thus, a variety of methods have been 
recently introduced with the aim of attributing the NN predictions to 
specific features in the input space and explaining their strategy. The 
so-called eXplainable Artificial Intelligence (XAI) is already seeing great 
application in a plethora of fields, offering promising results and insights 
about the decision strategies of NNs. Here, we provide an overview of 
the most recent work from our group, applying XAI to meteorology and 
climate science. Specifically, we present results from satellite applica- 
tions that include weather phenomena identification and image to image 
translation, applications to climate prediction at subseasonal to decadal 
timescales, and detection of forced climatic changes and anthropogenic 
footprint. We also summarize a recently introduced synthetic benchmark 
dataset that can be used to improve our understanding of different XAI 
methods and introduce objectivity into the assessment of their fidelity. 
With this overview, we aim to illustrate how gaining accurate insights 
about the NN decision strategy can help climate scientists and meteorol- 
ogists improve practices in fine-tuning model architectures, calibrating 
trust in climate and weather prediction and attribution, and learning 
new science. 
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1 Introduction 


In the last decade, artificial neural networks (NNs) [38] have been increasingly 
used for solving a plethora of problems in the earth sciences [5,7,21,27,36,41, 
60, 62,68,72], including marine science [41], solid earth science [7], and climate 
science and meteorology [5,21,62]. The popularity of NNs stems partially from 
their high performance in capturing/predicting nonlinear system behavior [38], 
the increasing availability of observational and simulated data [1,20,56,61], and 
the increase in computational power that allows for processing large amounts of 
data simultaneously. Despite their high predictive skill, NNs are not interpretable 
(usually referred to as “black box” models), which means that the strategy they 
use to make predictions is not inherently known (as, in contrast, is the case for 
e.g., linear models). This may introduce doubt with regard to the reliability of 
NN predictions and it does not allow scientists to apply NNs to problems where 
model interpretability is necessary. 

To address the interpretability issue, many different methods have recently 
been developed [3,4,32,53,69,70,73,75,77,84] in the emerging field of eXplain- 
able Artificial Intelligence (XAI) [9,12,78]. These methods aim at a post hoc 
attribution of the NN prediction to specific features in the input domain (usu- 
ally referred to as attribution/relevance heatmaps), thus identifying relationships 
between the input and the output that may be interpreted physically by the sci- 
entists. XAI methods have already offered promising results and fruitful insights 
into how NNs predict in many applications and in various fields, making “black 
box” models more transparent [50]. In the geosciences, physical understanding 
about how a model predicts is highly desired, so, XAI methods are expected to 
be a real game-changer for the further application of NNs in this field [79]. 

In this chapter, we provide an overview of the most recent studies from our 
group that implement XAI in the fields of climate science and meteorology. We 
focus here on outlining our work, the details of which we are more knowledge- 
able of, but we highlight that relevant work has been also established by other 
groups (see e.g., [18,34,52,74]). The first part of this overview presents results 
from direct application of XAI to solve various prediction problems that are of 
particular interest to the community. We start with X AI applications in remote 
sensing, specifically for image-to-image translation of satellite imagery to inform 
weather forecasting. Second, we focus on applications of climate prediction at 
a range of timescales from subseasonal to decadal, and last, we show how XAI 
can be used to detect forced climatic changes and anthropogenic footprint in 
observations and simulations. The second part of this overview explores ways 
that can help scientists gain insights about systematic strengths and weaknesses 
of different XAI methods and generally improve their assessment. So far in the 
literature, there has been no objective framework to assess how accurately an 
XAI method explains the strategy of a NN, since the ground truth of what the 
explanation should look like is typically unknown. Here, we discuss a recently 
introduced synthetic benchmark dataset that can introduce objectivity in assess- 
ing XAI methods’ fidelity for weather/climate applications, which will lead to 
better understanding and implementation. 
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The overall aim of this chapter is to illustrate how XAI methods can be 
used to help scientists fine tune NN models that perform poorly, build trust in 
models that are successful and investigate new physical insights and connections 
between the input and the output (see Fig. 1). 


i) XAI to guide the design of the NN architecture. One of the main challenges 
when using a NN is how to decide on the proper NN architecture for the 
problem at hand. We argue that XAI methods can be an effective tool for 
analysts to get insight into a flawed NN strategy and be able to revise it in 
order to improve prediction performance. 

ii) XAI to help calibrate trust in the NN predictions. Even in cases when a NN 
(i.e., or any black model in general) exhibits a high predictive performance, 
it is not guaranteed that the underlying strategy that is used for prediction 
is correct. This has famously been depicted in the example of “clever Hans”, 
a horse that was correctly solving mathematical sums and problems based on 
the reaction of the audience [35]. By using XAI methods, scientists can verify 
when a prediction is successful for the right reasons (i.e., they can test against 
“clever Hans” prediction models [35]), thus helping build model trust. 

iii) XAI to help learn new science. XAI methods allow scientists to gain 
physical insights about the connections between the input variables and the 
predicted output, and generally about the problem at hand. In cases where 
the highlighted connections are not fully anticipated/understood by already 
established science, further research and investigation may be warranted, 
which can accelerate learning new science. We highlight though that XAI 
methods will most often motivate new analysis to learn and establish new 
science, but cannot prove the existence of a physical phenomenon, link or 
mechanism, since correlation does not imply causation. 


The content of the chapter is mainly based on previously published work 
from our group [6,16,25,29,43,47,80], and is re-organized here to be easily fol- 
lowed by the non-expert reader. In Sect.2, we present results from various XAI 
applications in climate science and meteorology. In Sect.3, we outline a new 
framework to generate attribution benchmark datasets to objectively evaluate 
XAI methods’ fidelity, and in Sect. 4, we state our conclusions. 


2 XAI Applications 


2.1 XAI in Remote Sensing and Weather Forecasting 


As a first application of XAI, we focus on the field of remote sensing and short- 
term weather forecasting. When it comes to forecasting high impact weather 
hazards, imagery from geostationary satellites has been excessively used as a 
tool for situation awareness by human forecasters, since it supports the need for 
high spatial resolution and temporally rapid refreshing [40]. However, informa- 
tion from geostationary satellite imagery has less frequently been used in data 
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Fig. 1. XAI offers the opportunity for scientists to gain insights about the decision 
strategy of NNs, and help fine tune and optimize models, gauge trust and investigate 
new physical insights to establish new science. 


assimilation or integrated into weather-forecasting numerical models, despite the 
advantages that these data could offer in improving numerical forecasts. 

In recent work [25], scientists have used XAI to estimate precipitation over 
the contiguous United States from satellite imagery. These precipitating scenes 
that are typically produced by radars and come in the form of radar reflectivity 
can then be integrated into numerical models to spin up convection. Thus, the 
motivation of this research was to exploit the NNs’ high potential in capturing 
spatial information together with the large quantity, high quality and low latency 
of satellite imagery, in order to inform numerical modeling and forecasting. This 
could be greatly advantageous for mitigation of weather hazards. 

For their analysis, Hilburn et al. (2021) [25] developed a convolutional NN 
with a U-Net architecture (dubbed GREMLIN in the original paper). The inputs 
to the network were four-channel satellite images, each one containing brightness 
temperature and lightning information, over various regions around the US. As 
output, the network was trained to predict a single-channel image (i.e., an image- 
to-image translation application) that represents precipitation over the same 
region as the input, in the form of radar reflectivity and measured in dBZ. 
The network was trained against radar observations, and its overall prediction 
performance across testing samples was quite successful. Specifically, predictions 
from the GREMLIN model exhibited an overall coefficient of determination on 
the order of R? = 0.74 against the radar observations and a root mean squared 
difference on the order of 5.53 dBZ. 
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Apart from statistically evaluating the performance of GREMLIN predic- 
tions in reproducing reflectivity fields, it was also very important to assess the 
strategy upon which the model predicted. For this purpose, Hilburn et al. made 
use of a well-known XAI method, the Layer-wise Relevance Propagation (LRP 
[4]). Given an input sample and an output pixel, LRP reveals which features 
in the input contributed the most in deriving the value of the output. This 
is accomplished by sequentially propagating backwards the relevance from the 
output pixel to the neurons of the previous layers and eventually to the input 
features. So far, numerous different rules have been proposed in the literature 
as to how this propagation of relevance can be performed, and in this XAI 
application the alpha-beta rule was used [4], with alpha=1 and beta=0. The 
alpha-beta rule distinguishes between strictly positive and strictly negative pre- 
activations, which helps avoid the possibility of infinitely growing relevancies in 
the propagation phase, and it provides more stable results. 

In Fig.2, we show LRP results for GREMLIN for a specific sample, and a 
specific output pixel (namely, the central location of the shown sample), chosen 
for its close proximity to strong lightning activity. The first row of the figure 
shows the input channels and the corresponding desired output (i.e., the radar 
observation). The second row shows the LRP maps, highlighting which features 
in the input channels the neural network paid attention to in order to estimate 
the value of the chosen central output pixel for this sample. 

The LRP results for the channel with lightning information show that the 
network focused only on regions where lightning was present in that channel. The 
LRP results for the other channels show that even in those channels the NN’s 
attention was drawn to focus on regions where lightning was present. Hilburn 
et al. then performed a new experiment by modifying the input sample to have 
all lightning removed, that is, all the lightning values were set to zero. In this 
case, LRP highlighted that the network’s focus shifted entirely in the first three 
input channels, as expected. More specifically, the focus shifted to two types of 
locations, namely, (i) cloud boundaries, or (ii) areas where the input channels 
had high brightness (cold temperatures), as can be seen by comparing the three 
leftmost panels of the first, and third row. In fact, near the center of the third- 
row panels, it can be seen that the LRP patterns represent the union of the 
cloud boundaries and the locations of strongest brightness in the first row. LRP 
vanishes further away from the center location, as it is expected considering the 
nature of the effective receptive field that corresponds to the output pixel. 

The LRP results as presented above provide very valuable insight about how 
the network derives its estimates. Specifically, the results indicate the following 
strategy used by GREMLIN: whenever lightning is present near the output pixel, 
the NN primarily focuses on the values of input pixels where lightning is present, 
not only in the channel that contains the lightning information, but in all four 
input channels. It seems that the network has learned that locations containing 
lightning are good indicators of high reflectivity, even in the other input channels. 
When no lightning is present, the NN focuses primarily on cloud boundaries 
(locations where the gradient is strong) or locations of very cold cloud tops. The 
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Fig. 2. LRP results for the GREMLIN model. (top) The four input channels and 
the corresponding observed radar image (ground truth). (middle) LRP results for the 
original four input channels and the chosen output pixel, and the prediction from 
GREMLIN. (bottom) The equivalent of the middle row, but after all lighting values 
were set to zero. Note that all images are zoomed into a region centered at the pixel 
of interest. Adapted from Hilburn et al., 2021 [25]. 


network seems to have learned that these locations have the highest predictive 
power for estimating reflectivity. 

In this application of XAI in remote sensing, the obtained insights from LRP 
have given scientists the confidence that the network derived predictions based 
on a physically reasonable strategy and thus helped build more trust about 
its predictions. Moreover, if scientists wish to improve the model further by 
testing different model architectures, knowing how much physically consistent 
the different decision strategies of the models are offers a criterion to distinguish 
between models, which goes beyond prediction performance. 


2.2 XAI in Climate Prediction 


Similar to weather forecasting, climate prediction at subseasonal, seasonal and 
decadal timescales is among the most important challenges in climate science, 
with great societal risks and implications for the economy, water security, and 
ecosystem management for many regions around the world [8]. Typically, cli- 
mate prediction draws upon sea surface temperature (SST) information (espe- 
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cially on seasonal timescales and beyond), which are considered as the principal 
forcing variable of the atmospheric circulation that ultimately drives regional 
climate [19,23,30,44,57]. SST information is used for prediction either through 
deterministic models (i.e., SST-forced climate model simulations) or statistical 
models which aim to exploit physically- and historically- established teleconnec- 
tions of regional climate with large-scale modes of climate variability (e.g., the 
El Nifo-Southern Oscillation, ENSO; [11,17,44,45,48,49,54,59,67]). Limits to 
predictive skill of dynamical models arise from incomplete knowledge of initial 
conditions, uncertainties and biases in model physics, and limits on computa- 
tional resources that place constraints on the grid resolution used in operational 
systems. Similarly, empirical statistical models exhibit limited predictive skill, 
arising primarily from the complex and non-stationary nature of the relationship 
between large scale modes and regional climate. 

To address the latter, in more recent years, data-driven machine learning 
methods that leverage information from the entire globe (i.e., beyond prede- 
fined climate indices) have been suggested in the literature and they have shown 
improvements in predictive skill [13,76]. A number of studies have specifically 
shown the potential of neural networks in predicting climate across a range of 
scales, capitalizing on their ability to capture nonlinear dependencies (see e.g., 
[21]), while more recent studies have used XAI methods! to explain these net- 
works and their strategies to increase trust and learn new science [47,79,80]. 

In the first study outlined herein, Mayer and Barnes (2021) [47] used XAI 
in an innovative way to show that NNs can identify when favorable conditions 
that lead to enhanced predictive skill of regional climate are present in the atmo- 
sphere (the so called “forecasts of opportunity”) or not. More specifically, the 
authors based their analysis on the known climate teleconnections between the 
Madden-Julian Oscillation in the tropics (MJO; an eastward moving disturbance 
of convection in the tropical atmosphere) and the North Atlantic atmospheric 
pressure [10,24]. When the MJO is active, it leads to a consistent and coher- 
ent modulation of the midlatitude climate on subseasonal timescales, and thus, 
corresponds to enhanced predictive potential for the midlatitudes. The ques- 
tion that Mayer and Barnes put forward was whether or not NNs can capture 
this inherent property of the climate system of exhibiting periods of enhanced 
predictability (i.e. forecasts of opportunity). 

The authors used daily data of outgoing longwave radiation (OLR; a measure 
of convective activity) over the tropics of the globe and trained a fully connected 
NN to predict the sign of the 500 hPa geopotential height anomalies (a measure 
of atmospheric pressure) over the North Atlantic, 22 days later. Their results 
showed that when the network was assigning higher confidence to a prediction 


1 We note here that a newly introduced line of research in XAI that is potentially 
relevant for climate prediction applications is in the concept of causability. Although 
XAI is typically used to address transparency of AI, causability refers to the quality 
of an explanation and to the extend to which an explanation may allow the scientist 
to reach a specified level of causal understanding about the underlying dynamics of 
the climate system [26]. 
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(i.e., the likelihood of either the positive or the negative geopotential height class 
was much higher than the opposite class), it was much more likely for that pre- 
diction to end up being correct. On the contrary, when the network was assigning 
low confidence to a prediction (i.e., the likelihoods of the positive or negative 
geopotential height classes were very similar), the predictive performance of the 
network was much poorer, almost identical to a random guess. This meant that 
the NN was able to correctly capture the presence of forecasts of opportunity in 
the climate system. 


Positive Sign Predictions Negative Sign Predictions 


Relevance 


Fig. 3. Maps of LRP composites corresponding to the 10% most confident and correct 
predictions of positive and negative geopotential height anomalies. Contours indicate 
the corresponding composite fields of the outgoing longwave radiation with solid lines 
representing positive values and dashed lines negative values. Adapted from Mayer and 
Barnes et al., 2021 [47]. 


Mayer and Barnes continued in exploring which features over the tropics 
made the network highly confident during forecasts of opportunity, by using the 
LRP method. Figure3 shows the LRP heatmaps for positive and negative, cor- 
rectly predicted, anomalies of geopotential height over the North Atlantic. Note 
that only the top 10% of the most confident correct predictions were used for the 
LRP analysis (these predictions ought to represent cases of forecast of opportu- 
nity). As it is shown, LRP identified several sources of predictability over the 
southern Indian Ocean, the Maritime Continent and the western Pacific Ocean 
for positive predictions, and over the Maritime Continent, the western and cen- 
tral Pacific and over the western side of Hawaii for negative predictions. Judging 
by the OLR contours, the highlighted patterns correspond to dipoles of con- 
vection over the Indian Ocean and into the Maritime Continent in the first case 
and over the Maritime Continent and into the western Pacific in the second case. 
These patterns are consistent with the MJO structure and correspond to specific 
phases of the phenomenon, which in turn have been shown to be connected with 
the climate of the North Atlantic [10,24]. Thus, the implementation of LRP in 
this problem confirms that the network correctly captured the MJO-modulated 
forecasts of opportunity on subseasonal scales, and it further builds trust for the 
network’s predictive performance. 

In a second climate prediction application, this time on decadal scales, Toms 
et al. (2021) [80] used simulated data from fully-coupled climate models and 
explored sources of decadal predictability in the climate system. Specifically, 


Explainable Artificial Intelligence in Meteorology and Climate Science 323 


Toms et al. used global SST information as the predictor, with the aim of pre- 
dicting continental surface temperature around the globe; for each grid point 
over land, a separate dense network was used. In this way, by combining the 
large number of samples provided by the climate models (unrealistically large 
sample size compared to what is available in the observational record) and the 
ability of NNs to capture nonlinear dynamics, the authors were able to assess 
the predictability of the climate system in a nonlinear setting. Note that assess- 
ing predictability using observational records has been typically based on linear 
models of limited complexity to avoid overfitting, given the short sample sizes 
that are usually available [13,76]. Since the climate system is far from linear, 
the investigation by Toms et al. may be argued to provide a better estimate 
of predictability than previous work. The results showed that there are several 
regions where surface temperature is practically unpredictable, whereas there 
are also regions of high predictability, namely, “hotspots” of predictability, i.e., 
regions where the predictive skill is inherently high. The presence of hotspots of 
predictability is conceptually the same with the presence of forecasts of oppor- 
tunity on subseasonal scales that was discussed in the previous application. 


Relevance (unitless) 


0.08 0.4 


Fig. 4. Composite of LRP maps for the sea surface temperature (SST) field for accurate 
predictions of positive surface temperature anomalies at four locations across North 
America. The continental locations associated with the composites are denoted by the 
red dots in each panel. The LRP map for each sample is normalized between a value 
of 0 and 1 before compositing to ensure each prediction carries the same weight in 
the composite. The number of samples used in each composite is shown within each 
sub-figure. Adapted from Toms et al., 2021 [80]. 


Toms et al. explored the sources of predictability of surface temperature 
over North America by using the LRP method. Figure4 shows the composite 
LRP maps that correspond to correctly predicted positive temperature anomalies 
over four different regions in the North America. One can observe that different 
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SST patterns are highlighted as sources of predictability for each of the four 
regions. Perhaps surprisingly, temperature anomalies over Central America are 
shown to be most associated with SST anomalies off the east coast of Japan 
(Fig. 4a), likely related to the Kuroshio Extension [58]. SST anomalies over 
the North-Central Pacific Ocean are associated with continental temperature 
anomalies along the west coast (Fig. 4b), while those within the tropical Pacific 
Ocean contribute to predictability across central North America (Fig. 4c). Lastly, 
the North Atlantic SSTs contribute predictability to all four regions, although 
their impacts are more prominent across the northeastern side of the continent 
(Fig. 4d). The highlighted patterns of predictability as assessed by LRP resem- 
ble known modes of SST variability, such as the El Nino-Southern Oscillation 
(e.g.,[55,81]), the Pacific Decadal Oscillation [46,54], and the Atlantic Multi- 
decadal Oscillation [17]. These modes are known to affect hydroclimate over 
North America [11,17,48,49,54], thus, this application constitutes one more case 
where XAI methods can help scientists build model trust. More importantly, in 
this setting, physical insights can be extracted about sources of temperature 
predictability over the entire globe, by sequentially applying LRP to each of 
the trained networks. As Toms et al. highlight, such an analysis could motivate 
further mechanistic investigation to physically establish new climate teleconnec- 
tions. Thus, this application also illustrates how X AI methods can help advance 
climate science. 


2.3 XAI to Extract Forced Climate Change Signals 
and Anthropogenic Footprint 


As a final application of XAI to meteorology and climate science, we consider 
studies that try to identify human-caused climatic changes (i.e. climate change 
signals) and anthropogenic footprint in observations or simulations. Detect- 
ing climate change signals has been recognized in the climate community as 
a signal-to-noise problem, where the warming “signal” arising from the slow 
(long timescales), human-caused changes in the atmospheric concentrations of 
greenhouse gases is superimposed on the background “noise” of natural climate 
variability [66]. By solely using observations, one cannot identify which climatic 
changes are happening due to anthropogenic forcing, since there is no way to 
strictly infer the possible contribution of natural variability to these observed 
changes. Hence, the state-of-the-art approach to quantify or to account for natu- 
ral variability within the climate community is the utilization of large ensembles 
of climate model simulations (e.g., [14,28]). Specifically, researchers simulate 
multiple trajectories of the climate system, which start from slightly different 
initial states but share a common forcing (natural forcing or not). Under this 
setting, natural variability is represented by the range of the simulated future 
climates given a specific forcing, and the signal of the forced changes in the 
climate can be estimated by averaging across all simulations [51]. 

Utilizing these state-of-the-art climate change simulations, Barnes et al. 
(2020) [6] used XAI in an innovative way to detect forced climatic changes 
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in temperature and precipitation. Specifically, the authors trained a fully con- 
nected NN to predict the year that corresponded to a given (as an input) map 
of annual-mean temperature (or precipitation) that had been simulated by a cli- 
mate model. For the NN to be able to predict the year of each map correctly, it 
needs to learn to look and distinguish specific features of forced climatic change 
amidst the background natural variability and model differences. In other words, 
only robust (present in all models) and pronounced (not overwhelmed by natu- 
ral variability) climate change signals arising from anthropogenic forcing would 
make the NN to distinguish between a year in the early decades versus late 
decades of the simulation. Climate change signals that are weak compared to 
the background natural variability or exhibit high uncertainty across different 
climate models will not be helpful to the NN. 

In the way Barnes et al. have formed the prediction task, the prediction itself 
is of limited or no utility (i.e., there is no utility in predicting the year that a 
model-produced temperature map corresponds to; it is already known). Rather, 
the goal of the analysis is to explore which features help the NN distinguish each 
year and gain physical insight about robust signals of human-caused climate 
change. This means that the goal of the analysis lies on the explanation of the 
network and not the prediction. Barnes et al. trained the NN over the entire 
simulation period 1920-2099, using 80% of the climate model simulations and 
then tested on the remaining 20%. Climate simulations were carried out by 29 
different models, since the authors were interested in extracting climate change 
signals that are robust across multiple climate models. Results showed that the 
NN was able to predict quite successfully the correct years that different temper- 
ature and precipitation maps corresponded to. Yet, the performance was lower 
for years before the 1960s and much higher for years well into the 21st century. 
This is due to the fact that the climate change signal becomes more pronounced 
with time, which makes it easier to distinguish amidst the background noise and 
the model uncertainty. 

Next, Barnes et al. used LRP to gain insight into the forced climatic changes 
in the simulations that had helped the NN to correctly predict each year. Figure 5 
shows the LRP results for the years 1975, 2035 and 2095. It can be seen that 
different areas are highlighted during different years, which indicates that the 
relative importance of different climate change signals varies through time. For 
example, LRP highlights the North Atlantic temperature to be a strong indicator 
of climate change during the late 20th and early 21st century, but not during the 
late 21st century. On the contrary, the Southern Ocean gains importance only 
throughout the 21st century. Similarly, the temperature over eastern China is 
highlighted only in the late 20th century, which likely reflects the aerosol forcing 
which acts to decrease temperature. Thus, the NN learned that strong cooling 
over China relatively to the overall warming of the world is an indicator for the 
corresponding temperature map to belong to the late 20th century. 
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Fig. 5. LRP heatmaps for temperature input maps composited for a range of years 
when the prediction was deemed accurate. The years are shown above each panel along 
with the number of maps used in the composites. Darker shading denotes regions that 
are more relevant for the NN’s accurate prediction. Adapted from Barnes et al., 2020 [6]. 


The above results (see original study by Barnes et al. for more information) 
highlight the importance and utility of explaining the NN decisions in this pre- 
diction task and the physical insights that XAI methods can offer. As we men- 
tioned, in this analysis the explanation of the network was the goal, while the 
predictions themselves were not important. Generally, this application demon- 
strates that XAI methods constitute a powerful approach for extracting climate 
patterns of forced change amidst any background noise, and advancing climate 
change understanding. 

A second application where XAI was used to extract the anthropogenic foot- 
print was published by Keys et al. (2021) [29]. In that study, the authors aimed 
at constructing a NN to predict the global human footprint index (HFT) solely 
from satellite imagery. The HFI is a dimensionless metric that captures the 
extent to which humans have influenced the terrestrial surface of the Earth over 
a specific region (see e.g., [82,83]). Typically, the HFI is obtained by harmoniz- 
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ing eight different sub-indices, each one representing different aspects of human 
influence, like built infrastructure, population density, land use, land cover etc. 
So far, the process for establishing the HFI involves significant data analysis 
and modelling that does not allow for fast updates and continuous monitoring 
of the index, which means that large-scale, human-caused changes to the land 
surface may occur well before we are able to track them. Thus, estimating the 
HFI solely from satellite imagery that supports spatial resolution and tempo- 
rally rapid refreshing can help improve monitoring of the human pressure on the 
Earth surface. 

Keys et al. trained a convolutional NN to use single images of the land surface 
(Landsat; [22]) over a region to predict the corresponding Williams HFI [83]. The 
authors trained different networks corresponding to different areas around the 
world in the year 2000, and then used these trained networks to evaluate Landsat 
images from the year 2019. Results showed that the NNs were able to reproduce 
the HFI with high fidelity. Moreover, by comparing the estimated HFI in 2000 
with the one in 2019, the authors were able to gain insight into the changes 
in the human pressure to the earth surface during the last 20 years. Patterns 
of change were consistent with a steady expansion of the human pressure into 
areas of previously low HFI or increase of density of pressure in regions with 
previously high HFI values. 

Consequently, Keys et al. applied the LRP method for cases where the HFI 
increased significantly between the years 2000 and 2019. In this way, the authors 
aimed to gain confidence that the NN was focusing on the correct features in 
the satellite images to predict increases of the human footprint. As an example, 
in Fig. 6, we present the LRP results for a region over Texas, where wind farms 
were installed between the years 2000 and 2019; compare the satellite images in 
the left and middle panels of the figure. As shown in the LRP results, the NN 
correctly paid attention to the installed wind farm features in order to predict an 
increase of the HFI in the year 2019. By examining many other cases of increase 
in HFI, the authors reported that in most instances, the NN was found to place 
the highest attention to features that were clearly due to human activity, which 
provided them with confidence that the network performed with high accuracy 
for the right reasons. 


3 Development of Attribution Benchmarks 
for Geosciences 


As was illustrated in the previous sections, XAI methods have already shown 
their potential and been used in various climate and weather applications to 
provide valuable insights about NN decision strategies. However, many of these 
methods have been shown in the computer science literature to not honor 
desirable properties (e.g., “completeness” or “implementation invariance”; see 
[77]), and in general, to face nontrivial limitations for specific problem setups 
[2,15,31,63]. Moreover, given that many different methods have been proposed 
in the field of XAI (see e.g., [3,4,32,53,69, 70,73,75,77,84] among others) with 
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Wind farm installation (Texas, USA) 


Fig. 6. Satellite imagery from the Global Forest Change dataset over Texas, USA, in 
(left) 2000 and (middle) 2019. (right) the most relevant features to the NN for its 
year-2019 prediction of the HFI, as estimated using LRP. Adapted from Keys et al., 
2021 [29]. 


each one explaining the network in a different way, it is key to better understand 
differences between methods, both their relative strengths and weaknesses, so 
that researchers are aware which methods are more suitable to use depending 
on the model architecture and the objective of the explanation. Thus, thorough 
investigation and objective assessment of XAI methods is of vital importance. 

So far, the assessment of different XAI methods has been mainly based on 
applying these methods to benchmark problems, where the scientist is expected 
to know what the attribution heatmaps should look like, hence, being able to 
judge the performance of the XAI method in question. Examples of benchmark 
problems in climate science include the classification of El Nino or La Nina 
years or seasonal prediction of regional hydroclimate [21,79]. In computer sci- 
ence, commonly used benchmark datasets for image classification problems are, 
among others, the MNIST or ImageNet datasets [39,64]. Although the use of 
such benchmark datasets help the scientist gain some general insight about the 
XAI method’s efficiency, this is always based on the scientist’s subjective visual 
inspection of the result and their prior knowledge and understanding of the prob- 
lem at hand, which has high risk of cherry-picking specific samples/methods and 
reinforcing individual biases [37]. In classification tasks, for example, just because 
it might make sense to a human that an XAI method highlights the ears or the 
nose of a cat for an image successfully classified as “cat”, this does not necessarily 
mean that this is the strategy the model in question is actually using, since there 
is no objective truth about the relative importance of these two or other features 
to the prediction. The actual importance of different features to the network’s 
prediction is always case- or dataset-dependent, and the human perception of an 
explanation alone is not a solid criterion for assessing its trustworthiness. 

With the aim of a more falsifiable XAI research [37], Mamalakis et al. (2021) 
[43] put forward the concept of attribution benchmark datasets. These are syn- 
thetic datasets (consisting of synthetic inputs and outputs) that have been 
designed and generated in a way so that the importance of each input feature to 
the prediction is objectively derivable and known a priori. This a priort known 
attribution can be used as ground truth for evaluating different XAI methods 
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and identifying systematic strengths and weaknesses. The authors referred to 
such synthetic datasets as attribution benchmark datasets, to distinguish from 
benchmarks where no ground truth of the attribution/explanation is available. 
The framework was proposed for regression problems (but can be extended into 
classification problems too), where the input is a 2D field (i.e., a single-channel 
image); commonly found in geoscientific applications (e.g., [13,21,76,79]). Below 
we briefly summarize the proposed framework and the attribution benchmark 
dataset that Mamalakis et al. used, and we present comparisons between differ- 
ent XAI methods that provide insights about their performance. 


3.1 Synthetic Framework 


Mamalakis et al. considered a climate prediction setting (i.e., prediction of 
regional climate from global 2D fields of SST; see e.g., [13,76]), and generated 
N realizations of an input random vector X € R? from a multivariate Normal 
Distribution (see step 1 in Fig. 7); these are N synthetic inputs representing vec- 
torized 2D SST fields. Next, the authors used a nonlinear function F : RI > R, 
which represented the physical system, to map each realization £n into a scalar 
Yn, and generated the output random variable Y (see step 2 in Fig.7); these 
synthetic outputs represented the series of the predictand climatic variable. Sub- 
sequently, the authors trained a fully-connected NN to approximate function F 
and compare the model attributions estimated by different XAI methods with 
the ground truth of the attribution. The general idea of this framework is summa- 
rized in Fig. 7, and although the dataset was inspired from a climate prediction 
setting, the concept of attribution benchmarks is generic and applicable to a 
large number of problem settings in the geosciences and beyond. 

Regarding the form of function F that is used to generate the variable Y 
from X, Mamalakis et al. claimed that it can be of an arbitrary choice, as long 
as it has such a form so that the importance/contribution of each of the input 
variables to the response Y is objectively derivable. The simplest form for F so 
that the above property is honored is when F is an additively separable function, 
i.e. there exist local functions C;, with 1 = 1,2,...,d, so that: 


F(X) = F(X1, X2, ..., Xa) = C1(X1) + Co(X2) +... + Ca(Xa) (1) 


where, X; is the random variable at grid point i, and the local functions C; are 
nonlinear; if the local functions C; are linear, Eq. 1 falls back to a trivial linear 
problem, which is not particularly interesting to benchmark a NN or an XAI 
method against. Mamalakis et al., defined the local functions to be piece-wise 
linear functions, with number of break points K = 5. The break points and 
the slopes between the break points were chosen randomly for each grid point 
(see the original paper for more information). Importantly, with F being an 
additively separable function as in Eq. 1, the relevance/contribution of each of 
the variables X; to the response yn for any sample n, is by definition equal to 
the value of the corresponding local function, i.e., Pcs = C(x); that is when 
considering a zero baseline. This satisfies the basic desired property for F that 
any response can be objectively attributed to the input. 
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Fig. 7. Schematic overview of the framework to generate synthetic attribution bench- 
marks. In step 1, N independent realizations of a random vector X € #7 are generated 
from a multivariate Normal Distribution. In step 2, a response Y € R? to the synthetic 
input X is generated using a known nonlinear function F. In step 3, a fully-connected 
NN is trained using the synthetic data X and Y to approximate the function F. The 
NN learns a function Ê. Lastly, in step 4, the attribution heatmaps estimated from 
different XAI methods are compared to the ground truth (that represents the func- 
tion F), which has been objectively derived for any sample n = 1, 2,..., N. Similar to 
Mamalakis et al., 2021 [43]. 


Mamalakis et al. generated N = 10° samples of input and output and trained 
a fully connected NN to learn the function F (see step 3 in Fig. 7), using the 
first 900,000 samples for training and the last 100,000 samples for testing. Apart 
from assessing the prediction performance, the testing samples were also used 
to assess the performance of different post hoc, local XAI methods. The sample 
size was on purpose chosen to be large compared to typical samples in climate 
prediction applications. In this way, the authors aimed to ensure that they could 
achieve an almost perfect training and establish a fair assessment of XAI meth- 
ods; they wanted to ensure that any discrepancy between the ground truth of 
the attribution and the results of XAI methods came from systematic pitfalls in 
the XAI method and to a lesser degree from poor training of the NN. Indeed, 
the authors achieved a very high prediction performance, with the coefficient 
of determination of the NN prediction in the testing data being slightly higher 
than R? = 99%, which suggests that the NN could capture 99% of the variance 
in Y. 
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3.2 Assessment of XAI Methods 


For their assessment, Mamalakis et al. considered different post hoc, local 
XAI methods that have been commonly used in the literature. Specifically, 
the methods that were assessed included Gradient [71], Smooth Gradient [73], 
Input*Gradient [70], Intergradient Gradients [77], Deep Taylor [53] and LRP 
[4]. In Fig. 8, we present the ground truth and the estimated relevance heatmaps 
from the XAI methods (each heatmap is standardized by the corresponding max- 
imum absolute relevance within the map). This sample corresponds to a response 
Yn = 0.0283, while the NN predicted 0.0301. Based on the ground truth, features 
that contributed positively to the response y, occur mainly over the northern, 
eastern tropical and southern Pacific Ocean, the northern Atlantic Ocean, and 
the Indian Ocean. Features with negative contribution occur over the tropical 
Atlantic Ocean and the southern Indian Ocean. 

The results from the method Gradient are not consistent at all with the 
ground truth. In the eastern tropical and southern Pacific Ocean, the method 
returns negative values instead of positive, and over the tropical Atlantic, positive 
values (instead of negative) are highlighted. The pattern (Spearman’s) correla- 
tion is very small on the order of 0.13, consistent with the above observations. 
As theoretically expected, this result indicates that the sensitivity of the out- 
put to the input is not the same as the attribution of the output to the input 
[3]. The method Smooth Gradient performs poorly and similarly to the method 
Gradient, with a correlation coefficient on the order of 0.16. 


GROUND TRUTH 


Yn : 0.0283 
NN prediction: 0.0301 


Gradient Smooth Gradient Integrated Gradients 


Fig. 8. Performance of different XAI methods. The XAI performance is assessed by 
comparing the estimated heatmaps to the ground truth. All heatmaps are standard- 
ized with the corresponding maximum (absolute) value. Red (blue) color corresponds 
to positive (negative) contribution to the response/prediction, with darker shading rep- 
resenting higher (absolute) values. The Spearman’s rank correlation coefficient between 
each heatmap and the ground truth is also provided. Only for the methods Deep Taylor 
and LRPa=1,6=0, the correlation with the absolute ground truth is given. Similar to 
Mamalakis et al., 2021 [43]. (Color figure online) 
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Methods Input*Gradient and Integrated Gradients perform very similarly, 
both capturing the ground truth very closely. Indeed, both methods capture the 
positive patterns over eastern Pacific, northern Atlantic and the Indian Oceans, 
and to an extend the negative patterns over the tropical Atlantic and south- 
ern Indian Oceans. The Spearman’s correlation with the ground truth for both 
methods is on the order of 0.75, indicating the very high agreement. 

Regarding the LRP method, first, results confirm the arguments in [53,65], 
that the Deep Taylor leads to similar results with the LRPa=1,8=0, when a NN 
with ReLU activations is used. Second, both methods return only positive con- 
tributions. This was explained by Mamalakis et al. and is due to the fact that 
the propagation rule of LRP,=1,g=0 is performed based on the product of the 
relevance in the higher layer with a strictly positive number. Hence, the sign 
of the NN prediction is propagated back to all neurons and to all features of 
the input. Because the NN prediction is positive in Fig.8, then it is expected 
that LRPa=1,8=0 (and Deep Taylor) returns only positive contributions (see also 
remarks by [33]). What is not so intuitive is the fact that the LRP a=1,8=0 seems 
to highlight all important features, independent of the sign of their contribution 
(compare with ground truth). Given that, by construction, LRPa=1,8=0 consid- 
ers only positive preactivations [23], one might assume that it will only highlight 
the features that positively contribute to the prediction. However, the results in 
Fig. 8 show that the method highlights the tropical Atlantic Ocean with a pos- 
itive contribution. This is problematic, since the ground truth clearly indicates 
that this region is contributing negatively to the response y, in this example. 
The issue of LRPa=1,8=0 about highlighting all features independent of whether 
they are contributing positively or negatively to the prediction has been very 
recently discussed in other applications of XAI as well [33]. 

Lastly, when using the LRP, rule, the attribution heatmap very closely cap- 
tures the ground truth, and it exhibits a very high Spearman’s correlation on the 
order of 0.76. The results are very similar to those of the methods Input*Gradient 
and Integrated Gradients, making these three methods the best performing ones 
for this example. This is consistent with the discussion in [2], which showed 
the equivalence of the methods Input*Gradient and LRP, in cases of NNs with 
ReLU activation functions, as in this work. 

To verify that the above insights are valid for the entire testing dataset and 
not only for the specific example in Fig. 8, we also generated the histograms of 
the Spearman’s correlation coefficients between the XAI methods and the ground 
truth for all 100,000 testing samples (similarly to Mamalakis et al.). As shown 
in Fig.9, methods Gradient and Smooth Gradient perform very poorly (both 
exhibit almost zero average correlation with the ground truth), while methods 
Input*Gradient and Integrated Gradients perform equally well, exhibiting an 
average correlation with the ground truth around 0.7. The LRP, rule is seen to 
be the best performing among the LRP rules, with very similar performance to 
the Input*Gradient and Integrated Gradients methods (as theoretically expected 
for this model setting; see [2]). The corresponding average correlation coefficient 
is also on the order of 0.7. Regarding the LRPa=1,8=0 rule, we present two 
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Fig. 9. Summary of the performance of different XAI methods. Histograms of the 
Spearman’s correlation coefficients between different XAI heatmaps and the ground 
truth for 100,000 testing samples. Similar to Mamalakis et al., 2021 [43]. 


curves. The first curve (black curve in Fig.9) corresponds to correlation with 
the ground truth after we have set all the negative contributions in the ground 
truth to zero. The second curve (blue curve) corresponds to correlation with the 
absolute value of the ground truth. For both curves we multiply the correlation 
value with -1 when the NN prediction was negative, to account for the fact that 
the prediction’s sign is propagated back to the attributions. Results show that 
when correlating with the absolute ground truth (blue curve), the correlations 
are systematically higher than when correlating with the nonnegative ground 
truth (black curve). This verifies that the issue of LRPg=1,g=0 highlighting both 
positive and negative attributions occurs for all testing samples. 

In general, these results demonstrate the benefits of attribution benchmarks 
for the identification of possible systematic pitfalls of XAI. The above assess- 
ment suggests that methods Gradient and Smooth Gradient may be suitable for 
estimating the sensitivity of the output to the input, but this is not necessar- 
ily equivalent to the attribution. When using the LRPa=1,8=0 rule, one should 
be cautious, keeping always in mind that, i) it might propagate the sign of the 
prediction back to all the relevancies of the input layer and ii) it is likely to mix 
positive and negative contributions. For the setup used here (i.e. to address the 
specific prediction task using a shallow, fully connected network), the methods 
Input*Gradient, Integrated Gradients, and the LRP, rule all very closely cap- 
tured the true function F and are the best performing XAI methods considered. 
However, this result does not mean that the latter methods are systematically 
better performing for all types of applications. For example, in a different pre- 
diction setting (i.e. for a different function F) and when using a deep convolu- 
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tional neural network, the above methods have been found to provide relatively 
incomprehensible explanations due to gradient shattering [42]. Thus, no optimal 
method exists in general and each method’s suitability depends on the type of 
the application and the adopted model architecture, which highlights the need 
to objectively assess XAI methods for a range of applications and develop best- 
practice guidelines. 


4 Conclusions 


The potential of NNs to successfully tackle complex problems in earth sciences 
has become quite evident in recent years. An important requirement for further 
application and exploitation of NNs in geoscience is their interpretability, and 
newly developed XAI methods show very promising results for this task. In this 
chapter we provided an overview of the most recent work from our group, apply- 
ing XAI to meteorology and climate science. This overview clearly illustrates 
that XAI methods can provide valuable insights on the NN strategies, and that 
they are used in these fields under many different settings and prediction tasks, 
being beneficial for different scientific goals. For many applications that have 
been published in the literature, the ultimate goal is a highly-performing predic- 
tion model, and XAI methods are used by the scientists to calibrate their trust 
to the model, by ensuring that the decision strategy of the network is physically 
consistent (see e.g., [18,21,25,29,35,47,80]). In this way scientists can ensure 
that a high prediction performance is due to the right reasons, and that the 
network has learnt the true dynamics of the problem. Moreover, in many pre- 
diction applications, the explanation is used to help guide the design of the 
network that will be used to tackle the prediction problem (see e.g., [16]). As we 
showed, there are also applications where the prediction is not the goal of the 
analysis, but rather, the scientists are interested solely in the explanation. In 
this category of studies, XAI methods are used to gain physical insights about 
the dynamics of the problem or the sources of predictability. The highlighted 
relationships between the input and the output may warrant further investiga- 
tion and advance our understanding, hence, establishing new science (see e.g., 
(6, 74, 79, 80]). 

Independent of the goal of the analysis, an important aspect in XAI research 
is to better understand and assess the many different XAI methods that exist, 
in order to more successfully implement them. This need for objectivity in the 
XAI assessment arises from the fact that XAI methods are typically assessed 
without the use of any ground truth to test against and the conclusions can often 
be subjective. Thus, here we also summarized a newly introduced framework to 
generate synthetic attribution benchmarks to objectively test XAI methods [43]. 
In the proposed framework, the ground truth of the attribution of the output to 
the input is derivable for any sample and known a priori. This allows the scientist 
to objectively assess if the explanation is accurate or not. The framework is 
based on the use of additively separable functions, where the response Y € R to 
the input X € R? is the sum of local responses. The local responses may have 
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any functional form, and independent of how complex that might be, the true 
attribution is always derivable. We believe that a common use and engagement 
of such attribution benchmarks by the geoscientific community can lead to a 
more cautious and accurate application of XAI methods to physical problems, 
towards increasing model trust and facilitating scientific discovery. 
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Abstract. The quest to explain the output of artificial intelligence sys- 
tems has clearly moved from a mere technical to a highly legally and 
politically relevant endeavor. In this paper, we provide an overview of 
legal obligations to explain AI and evaluate current policy proposals. In 
this, we distinguish between different functional varieties of AI expla- 
nations - such as multiple forms of enabling, technical and protective 
transparency - and show how different legal areas engage with and man- 
date such different types of explanations to varying degrees. Starting 
with the rights-enabling framework of the GDPR, we proceed to uncover 
technical and protective forms of explanations owed under contract, tort 
and banking law. Moreover, we discuss what the recent EU proposal 
for an Artificial Intelligence Act means for explainable AI, and review 
the proposal’s strengths and limitations in this respect. Finally, from a 
policy perspective, we advocate for moving beyond mere explainability 
towards a more encompassing framework for trustworthy and responsible 
AI that includes actionable explanations, values-in-design and co-design 
methodologies, interactions with algorithmic fairness, and quality bench- 
marking. 


Keywords: Artificial intelligence - Explainability - Regulation 


1 Introduction 


Sunlight is the best disinfectant, as the saying goes. Therefore, it does not come 
as a surprise that transparency constitutes a key societal desideratum vis-a- 
vis complex, modern IT systems in general [67] and artificial intelligence (AT) 
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in particular [18,74]. As in the case of very similar demands concerning other 
forms of opaque or, at least from an outsider perspective, inscrutable decision 
making processes of bureaucratic systems, transparency is seen as a means of 
making decisions more understandable, more contestable, or at least more ratio- 
nal. More specifically, explainability of AI systems generally denotes the degree 
to which an observer may understand the causes of the system’s output [15,64]. 
Various technical implementations of explainability have been suggested, from 
truth maintenance systems for causal reasoning in the case of symbolic rea- 
soning systems that were developed mainly from the 1970s s to the 1990s s to 
layerwise relevance propagation methods for neural networks today. Importantly, 
observers, and with them the adequate explanations for a specific context, may 
vary [3, p. 85]. 

In recent years, the quest for transparent and explainable AI has not only 
spurred a vast array of research efforts in machine learning [3,82, and the chap- 
ters in this volume for an overview], but it has also emerged at the heart of many 
ethics and responsible design proposals [43,45,66,68] and has nurtured a vivid 
debate on the promises and limitations of advanced machine learning models for 
various high-stakes scenarios [12,37,88]. 


1.1 Functional Varieties of AI Explanations 


Importantly, from a normative perspective, different arguments can be advanced 
to justify the need for transparency in AI systems [3]. For example, given its rela- 
tion to human autonomy and dignity, one may advance a ‘deontological’ con- 
ception viewing transparency as an aim in itself [17,92, 104]. Moreover, research 
suggests that explanations may satisfy the curiosity of counterparties, their desire 
for learning or control, or fulfill basic communicative standards of dialogue and 
exchange [59,62,64]. From a legal perspective, however, it is submitted that three 
major functional justifications for demands of AI explainability may be distin- 
guished: enabling, technical, and protective varieties. All of them subscribe to 
an ‘instrumentalist’ approach conceiving of transparency as a means to achieve 
technically or normatively desirable ends. 

First, explainability of AI is seen as a prerequisite for empowering those 
affected by its decisions or charged with reviewing them (‘enabling trans- 
parency’). On the one hand, explanations are deemed crucial to afford due pro- 
cess to the affected individuals [23] and to enable them to effectively exercise their 
subjective rights vis-a-vis the (operators of the) AI system [89] (‘rights-enabling 
transparency’). Similarly, other parties such as NGOs, collective redress organi- 
zations or supervisory authorities may use explanations to initiate legal reviews, 
e.g. by inspecting AI systems for unlawful behavior such as manipulation or 
discrimination [37, p. 55|(‘review-enabling transparency’). On the other hand, 
information about the functioning of AI systems may facilitate informed choice 
of the affected persons about whether and how to engage with the models or 
the offers they accompany and condition. Such ‘decision-enabling transparency’ 
seeks to support effective market choice, for example by switching contracting 
partners [14, p. 156]. 
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Second, with respect to technical functionality, explainability may help fine- 
tune the performance (e.g., accuracy) of the system in real-world scenarios and 
evaluate its generalizability to unseen data [3,47,57,79]. In this vein, it also 
acts as a catalyst for informed decision making, though not of the affected 
persons, but rather of the technical operator or an expert auditor of the sys- 
tem. That approach may hence be termed ‘technical transparency’, its explana- 
tions being geared toward a technically sophisticated audience. Beyond model 
improvements, a key aim here is to generate operational and institutional trust 
in the AI system [37, p. 54], both in the organization operating the AI system 
and beyond in the case of third-party reviews and audits. 

Third, technical improvements translate into legal relevance to the extent 
that they contribute to reducing normatively significant risks. Hence, technically 
superior performance may lead to improved safety (e.g., AI in robots; medical 
AI), reduced misallocation of resources (e.g., planning and logistics tools), or 
better control of systemic risks (e.g., financial risk modelling). This third variety 
could be dubbed ‘protective transparency’, as it seeks to harness explanations 
to guard against legally relevant risks. 

These different types of legally relevant, functional varieties of AI explana- 
tions are not mutually exclusive. For example, technical explanations may, to 
the extent available, also be used by collective redress organizations or supervi- 
sory authorities in a review-enabling way. Nonetheless, the distinctions arguably 
provide helpful analytical starting points. As we shall see, legal provisions com- 
pelling transparency are responsive to these different strands of justification to 
varying degrees. It should not be overlooked, however, that an excess of sunlight 
can be detrimental as well, as skeptics note: explainability requirements may 
not only impose significant and sometimes perhaps prohibitive burdens on the 
use of some of the most powerful AI systems, but also offer affected persons the 
option to strategically “game the system” and accrue undeserved advantages 
[9]. This puts differentiated forms of accountability front and center: to whom - 
users, affected persons, professional audit experts, legitimized rights protection 
organizations, public authorities - should an AI system be transparent? Such 
limitations need to be considered by the regulatory framework as well. 


1.2 Technical Varieties of AI Explanations 


From a technical perspective, in turn, it seems uncontroversial that statements 
about AI and explainability, as well as the potential trade-off with accuracy, 
must be made in a context- and model-specific way [57,81][3, p. 100]. While 
some types of ML models, such as linear or logistic regressions or small decision 
trees [22,47,57], lend themselves rather naturally to global explanations about 
the feature weights for the entire model (often called ex ante interpretability), 
such globally valid statements are much harder to obtain for other model types, 
particularly random forests or deep neural networks [57,79,90]. In recent years, 
such complex model types have been the subject of intense technical research 
to provide for, at the minimum, local explanations of specific decisions ex post, 
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often by way of sensitivity analysis [31,60,79]. One specific variety of local expla- 
nations seeks to provide counterfactuals, i.e., suggestions for minimal changes 
of the input data to achieve a more desired output [64,97]. Counterfactuals are 
a variety of contrastive explanations, which seek to convey reasons for the con- 
crete output (‘fact’) in relation to another, possible output (‘foil’) and which 
have recently gained significant momentum [65,77]. Other methods have sought 
to combine large numbers of local explanations to approximate a global explana- 
tory model of the AI system by way of overall feature relevance [16,55], while 
other scholars have sought to fiercely defend the benefits of designing models 
that are interpretable ex ante rather than explainable ex post [81]. 


1.3 Roadmap of the Paper 


Arguably, much of this research has been driven, at least implicitly, by the 
assumption that explainable AI systems would be ethically desirable and per- 
haps even legally required [47]. Hence, this paper seeks to provide an overview of 
explainability obligations flowing from the law proper, while engaging with the 
functional and technical distinctions just introduced. The contemporary legal 
debate has its roots in an interpretive battle over specific norms of the GDPR 
[89,96], but has recently expanded beyond the precincts of data protection law to 
other legal fields, such as contract and tort law [42,84]. As this paper will show, 
another important yet often overlooked area which might engender incentives to 
provide explanations for AI models is banking law [54]. Finally, the question of 
transparency has recently been taken up very prominently by the regulatory pro- 
posals at the EU level, particularly in the Commission proposal for an Artificial 
Intelligence Act (AIA). It should be noted that controversies and consultations 
about how to meaningfully regulate AI systems are still ongoing processes and 
that the questions of what kind of explainability obligations follow already from 
existing regulations and which obligations should - in the future - become part 
of AI policy are still very much in flux. This begs the question of the extent to 
which these diverging provisions and calls for explainability properly take into 
account the usability of that information for the recipients, in other words: the 
actionability of explainable AI (XXAI), which is also at the core of this volume. 

Against this background, the paper will use the running example of credit 
scoring to investigate whether positive law mandates, or at least sets incentives 
for, the provision of actionable explanations in the use of AI tools, particularly 
in settings involving private actors (Sect. 2); to what extent the proposals for AI 
regulation at the EU level will change these findings (Sect. 3); and how regulation 
and practice could go beyond such provisions to ensure actionable explanations 
and trustworthy AI (Sect. 4). In all of these sections, the findings will be linked to 
the different (instrumentalist) functions of transparency, which are taken up to 
varying degrees by the different provisions and proposals. Figure 1 below provides 
a quick overview of the relations between functions and several existing legal 
acts surveyed in this paper; Fig.2 (in Sect.3) connects these functions to the 
provisions of the planned AIA. 
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Fig. 1. Overview of the functions of different EU law instruments concerning AI expla- 
nations; abbreviations: GDPR: General Data Protection Regulation; CRR: Capital 
Requirements Regulation; PLD: Product Liability Directive 


2 Explainable AI Under Current Law 


The quest for explainable AI interacts with existing law in a number of ways. 
The scope of this paper will be EU law, and for the greatest part the law govern- 
ing exchange between private parties more particularly (for public law, see, e.g. 
[14, 2.2]). Most importantly, and bridging the public-privates divide, the GDPR 
contains certain rules, however limited and vague, which might be understood as 
an obligation to provide explanations of the functioning of AI models (Sect. 2.1.). 
Beyond data protection law, however, contract and tort law (Sect. 2.2) and bank- 
ing law (Sect. 2.3) also provide significant incentives for the use of explainable 
AI (XAI). 


2.1 The GDPR: Rights-Enabling Transparency 


In the GDPR, whether a subjective right to an explanation of AI decisions 
exists or not has been the object of a long-standing scholarly debate which, until 
this day, has not been finally settled [36,61,89,96]. To appreciate the different 
perspectives, let us consider the example of Al-based credit scoring. Increasingly, 
startups use alternative data sets and machine learning to compute credit scores, 
which in turn form the basis of lending decisions (see, e.g., [34,54]). If a particular 
person receives a specific credit score, the question arises if, under the GDPR, the 
candidate may claim access to the feature values used to make the prediction, to 
the weights of the specific features in his or her case (local explanation), or even 
to the weights of the features in the model more generally (global explanation). 
For example, the person might want to know what concrete age and income 
values were used to predict the score, to what extent age or income contributed 
to the prediction in the specific case, and how the model generally weights these 
features. 
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So far, there is no guidance by the Court of Justice of the European Union 
(CJEU) on precisely this question. However, exactly this case was decided by 
the German Federal Court for Private Law (BGH) in 2014 (BGH, Case VI 
ZR 156/13 = MMR, 2014, 489). The ruling came down not under the GDPR, 
but its predecessor (the 1995 Data Protection Directive) and relevant German 
data protection law. In substance, however, the BGH noted that the individual 
information interest of the plaintiff needed to be balanced against the legitimate 
interests of the German credit scoring agency (Schufa) to keep its trade secrets, 
such as the precise score formula for credit scoring, hidden from the view of the 
public, lest competitors free ride on its know-how. In weighing these opposing 
interests, the BGH concluded that the plaintiff did have a right to access its 
personal data processed for obtaining the credit score (the feature values), but 
not to obtain information on the score formula itself, comparison groups, or 
abstract methods of calculation. Hence, the plaintiff was barred from receiving 
either a local or a global explanation of its credit score. 


2.1.1 Safeguards for Automated Decision Making 

How would such a case be decided under the GDPR, particularly if an AI- 
based scoring system was used? There are two main normative anchors in the 
GDPR that could be used to obtain an explanation of the score, and hence more 
generally of the output of an AI system. First, Article 22 GDPR regulates the 
use of automated decision making in individual cases. That provision, however, 
is subject to several significant limitations. Not only does its wording suggest 
that it applies only to purely automated decisions, taken independently of even 
negligible human interventions (a limitation that could potentially be overcome 
by a more expansive interpretation of the provision, see [96]); more importantly, 
the safeguards it installs in Article 22(3) GDPR for cases of automated decision 
making list ‘the right to obtain human intervention on the part of the controller, 
to express his or her point of view and to contest the decision’ - but not the right 
to an explanation. Rather, such a right is only mentioned in Recital 71 GDPR, 
which provides additional interpretive guidance for Article 22(3) GDPR. Since, 
however, only the Articles of the regulation, not the recitals, constitute binding 
law, many scholars are rightly skeptical whether the CJEU would deduce a right 
to an explanation (of whatever kind) directly from Article 22(3) GDPR [84,96]. 


2.1.2 Meaningful Information About the Logic Involved 

A second, much more promising route is offered by different provisions oblig- 
ing the data controller (i.e., the operator of the AI system) to provide the data 
subject not only with information on the personal data processed (the feature 
values), but also, at least in cases of automated decision making, with ‘mean- 
ingful information about the logic involved’ (Art. 13(2)(f), Art. 14(2)(g), Art. 
15(1)(h) GDPR). 


A Rights-Enabling Conception of Meaningful Information. Since the publication 
of the GDPR, scholars have intensely debated what these provisions mean for 
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AI systems (see, e.g. for overviews [20,49]. For instance, in our running example, 
we may more concretely ask whether a duty to disclose local or global weights 
of specific features exists in the case of credit scoring. Some scholars stress the 
reference to the concept of ‘logic’, which to them suggests that only the general 
architecture of the system must be divulged, but not more specific information 
on features and weights [73, para. 31c][103]. A more convincing interpretation, 
in our view, would take the purpose of the mentioned provisions into account. 
Hence, from a teleological perspective, the right to meaningful information needs 
to be read in conjunction with the individual rights the GDPR confers in Art. 
16 et seqq. [89]. Such a rights-enabling instrumentalist approach implies that 
information will only be meaningful, to the data subject, if it facilitates the 
exercise of these rights, for example the right to erasure, correction, restriction 
of processing or, perhaps most importantly, the contestation of the decision pur- 
suant to Article 22(3) GDPR. An overarching view of the disclosure provisions 
forcing meaningful information and the safeguards in Article 22(3) GDPR there- 
fore suggests that, already under current data protection law, the information 
provided must be actionable to fulfill its enabling function. Importantly, this 
directly relates to the quest of XXAI research seeking to provide explanations 
that enable recipients to meaningfully reflect upon and intervene in Al-powered 
decision-making systems. 

Hence, in our view, more concrete explanations may have to be provided if 
information about the individual features and corresponding weights are nec- 
essary to formulate substantive challenges to the algorithmic scores under the 
GDPR’s correction, erasure or contestation rights. Nevertheless, as Article 15(4) 
GDPR and more generally Article 16 of the Charter of Fundamental Rights of 
the EU (freedom to conduct the business) suggest, the information interests of 
the data subject must still be balanced against the secrecy interests of the con- 
troller, and their interest in protecting the integrity of scores against strategic 
gaming. In this reading, a duty to provide actionable yet proportionate informa- 
tion follows from Art. 13(2)(f), Art. 14(2)(g) and Art. 15(1)(h) GDPR, read in 
conjunction with the other individual rights of the data subject. 


Application to Credit Scores. In the case of Al-based credit scores, such a regime 
may be applied as follows. In our view, meaningful information will generally 
imply a duty to provide local explanations of individual cases, i.e., the disclosure 
of at least the most important features that contributed to the specific credit 
score of the applicant. This seems to be in line with the (non-binding) interpreta- 
tion of European privacy regulators (Article 29 Data Protection Working Party, 
2018, at 25-26). Such information is highly useful for individuals when exercising 
the mentioned rights and particularly for contesting the decision: if, for exam- 
ple, it turns out that the most important features do not seem to be related in 
any plausible way to creditworthiness or happen to be closely correlated with 
attributes protected under non-discrimination law, the data subject will be in 
a much better position to contest the decision in a substantiated way. Further- 
more, if only local information is provided, trade secrets are implicated to a much 
lesser extent than if the entire score formula was disclosed; and possibilities to 
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‘game the system’ are significantly reduced. Finally, such local explanations can 
increasingly be provided even for complex models, such as deep neural networks, 
without loss of accuracy [31,79]. 

On the other hand, meaningful information will generally not demand the 
disclosure of global explanations, i.e., of weights referring to the entire model. 
While this might be useful for individual complainants to detect, for example, 
whether their case represents an outlier (i.e., features were weighted differently 
in the individual case than generally in the model), the marginal benefit of a 
global explanation vis-a-vis a local explanation seems outweighed by the much 
more significant impact on trade secrets and incentives to innovation if weights 
for an entire model need to be disclosed. Importantly, such a duty to provide 
global explanations would also significantly hamper the use of more complex 
models, such as deep neural networks (cf. [14, p. 162]. While such technical 
limitations do not generally speak against certain interpretations of the law 
(see, e.g., BVerfG NJW 1979, 359, para. 109 - Kalkar), they seem relevant here 
because such models may, in a number of cases, perform better in the task of 
credit scoring than simpler but globally explainable models. If this premise holds, 
another provision of EU law becomes relevant. More accurate models allow to 
fulfill the requirements of responsible lending to a better extent (see Sect. 2.3 for 
details): if models more correctly predict creditworthiness, loans will be handed 
out more often only to persons who are indeed likely to repay the loan. Since 
this is a core requirement of the post-financial crisis framework of EU credit law, 
it should be taken into account in the interpretation of the GDPR in cases of 
credit scoring as well (see, for such overarching interpretations of different areas 
of EU law, CJEU, Case C-109/17, Bankia, para. 49; [38]). 

Ultimately, for local and global explanations alike, a compromise between 
information interests and trade secrets might require the disclosure of weights 
not in a highly granular, but in a ‘noisy’ fashion (e.g., providing relevance inter- 
vals instead of specific percentage numbers) [6, para. 54]. Less mathematically 
trained persons often disregard or have trouble cognitively processing probability 
information in explanations [64] so that the effective information loss for recipi- 
ents would likely be limited. Noisy weights, or simple ordinal feature ranking by 
importance, would arguably convey a measure enabling meaningful evaluation 
and critique while safeguarding more precise information relevant for the com- 
petitive advantage of the developer of the AI system, and hence for incentives 
to innovation. Such less granular information could be provided whenever the 
confidentiality of the information is not guaranteed; if the information is treated 
confidentially, for example in the framework of a specific procedure in a review 
or audit, more precise information might be provided without raising concerns 
about unfair competition. The last word on these matters will, of course, have 
the CJEU. It seems not unlikely, though, that the Court would be open to 
an interpretation guaranteeing actionable yet proportionate information. This 
would correspond to a welcome reading of the provisions of the GDPR with 
a view to due process and the exercise of subjective rights by data subjects 
(rights-enabling transparency). 
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2.2 Contract and Tort Law: Technical and Protective Transparency 


In data protection law, as the preceding section has shown, much will depend 
on the exact interpretation of the vague provisions of the GDPR, and on the 
extent to which these provisions can be applied even if humans interact with AI 
systems in more integrated forms of decision making. These limitations should 
lead us to consider incentives for actionable AI explanations in other fields of 
the law, such as contract and tort law. This involves particularly product liabil- 
ity (Sect.2.2.1), and general negligence standards under contract and tort law 
(Sect. 2.2.2). Clearly, under freedom of contract, parties may generally contract 
for specific explanations that the provider of an AI system may have to enable. 
In the absence of such explicit contractual clauses, however, the question arises 
to what extent contract and tort law still compel actionable explanations. As 
we shall see, in these areas, the enabling instrumentalist variety of transparency 
(due process, exercise of rights) is to a great extent replaced by a more techni- 
cal and protective instrumentalist approach focusing on trade-offs with accuracy 
and safety. 


2.2.1 Product Liability 

In product liability law, the first persevering problem is the extent to which 
it applies to non-tangible goods such as software. Article 2 of the EU Product 
Liability Directive (PLD), passed in 1985, defines a product as any movable, as 
well as electricity. While an AI system embedded in a physical component, such 
as a robot, clearly qualifies as a product under Article 2, this is highly contested 
for a standalone system such as, potentially, a credit scoring application (see 
[84,99]). In the end, at least for professionally manufactured software, one will 
have to concede that it exhibits defect risks similar to traditional products and 
entails similar difficulties for plaintiffs in proving them, which speaks strongly in 
favor of applying the PLD, at least by analogy, to such software independently 
of any embeddedness in a movable component [29, p. 43]. A proposal by the EU 
Commission on that question, and on liability for AI more generally, is expected 
for 2022. 


Design Defects. As it currently stands, the PLD addresses producers by provid- 
ing those harmed by defective products with a claim against them (Art. 1 PLD). 
There are different types of defects a product may exhibit, the most important 
in the context of AI being a design defect. With respect to the topic of this 
paper, one may therefore ask if the lack of an explanation might qualify as a 
design defect of an AI system. This chiefly depends on the interpretation of the 
concept of a design defect. 

In EU law, two rivaling interpretations exist: the consumer expectations test 
and the risk-utility test. Article 6 PLD at first glance seems to enshrine the 
former variety by holding that a ‘product is defective when it does not provide 
the safety which a person is entitled to expect’. The general problem with this 
formulation is that it is all but impossible to objectively quantify legitimate 
consumer expectations [99]. For example, would the operator of an AI system, 
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the affected person, or the public in general be entitled to expect explanations, 
and if so, which ones? 

Product safety law is often understood to provide minimum standards in 
this respect [100, para. 33]; however, exact obligations on explainability of AI are 
lacking so far in this area, too (but see Annex I, Point 1.7.4.2.(e) of the Machinery 
Directive 2006/42 and Sect.3). Precisely because of these uncertainties, many 
scholars prefer the risk-utility test which has a long-standing tradition in US 
product liability law (see § 402A Restatement (Second) of Torts). Importantly, 
it is increasingly used in EU law as well [86][99, n. 48] and was endorsed by 
the BGH in its 2009 Airbag decision!. Under this interpretation, a design defect 
is present if the cost of a workable alternative design, in terms of development 
and potential reduced utility, is smaller than the gain in safety through this 
alternative design. Hence, the actually used product and the workable alternative 
product must be compared considering their respective utilities and their risks 
(94, p. p. 246]. 

With respect to XAI, it must hence be asked if an interpretable tool would 
have provided additional safety through the explanation, and if that marginal 
benefit is not outweighed by additional costs. Such an analysis, arguably, aligns 
with a technical and protective instrumentalist conception of transparency, as a 
means to achieve safety gains. Importantly, therefore, the analysis turns not only 
on the monetary costs of adding explanations to otherwise opaque AI systems, 
but it must also consider whether risks are really reduced by the provision of an 
explanation. 

The application of the risk-utility test to explainability obligations has, to our 
knowledge, not been thoroughly discussed in the literature yet (for more general 
discussions, see [87, p. 1341, 1375][42]. Clearly, XAI may be helpful, in evidentiary 
terms, for producers in showing that there was no design defect involved in 
an accident [19, p. 624][105, p. 217]; but is XAI compulsory under the test? 
The distinguishing characteristic of applying a risk-utility test to explainable AI 
seems to be that the alternative (introducing explainability) does not necessarily 
reduce risk overall: while explanations plausibly lower the risk of misapplication 
of the AI system, they might come at the expense of accuracy. Therefore, in our 
view, the following two cases must be distinguished: 


1. The explainable model exhibits the same accuracy as the original, non- 
explainable model (e.g., ex post local explanation of a DNN). In that case, 
only the expected gain in safety, from including explanations, must be weighed 
against potential costs of including explanations, such as longer run time, 
development costs, license fees etc. Importantly, as the BGH specified in its 
Airbag ruling, the alternative model need not only be factually ready for use, 
but its use must also be normatively reasonable and appropriate for the pro- 
ducer”. This implies that, arguably, trade secrets must be considered in the 
analysis, as well. Therefore, it seems sensible to assume that, as in data pro- 
tection law, a locally (but not a globally) explainable model must be chosen, 


' BGH, 16.6.2009, VI ZR 107/08, BGHZ 181, 253 para 18. 
? BGH, 16.6.2009, VI ZR 107/08, BGHZ 181, 253 para 18. 
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unless the explainable add-on is unreasonably expensive. Notably, the more 
actionable explanations are in the sense of delivering clear cues for operators, 
or affected persons, to minimize safety risks, the stronger the argument that 
such explanations indeed must be provided to prevent a design defect. 

2. Matters are considerably more complicated if including explanations lowers 
the accuracy of the model (e.g., switching to a less powerful model type): in 
this case, it must first be assessed whether explanations enhance safety overall, 
by weighing potential harm from lower accuracy against potential prevention 
of harm from an increase in transparency. If risk is increased, the alterna- 
tive can be discarded. If, however, it can be reasonably expected that the 
explanations entail a risk reduction, this reduction must be weighed against 
any additional costs the inclusion of explainability features might entail, as 
in the former case (risk-utility test). Again, trade secrets and incentives for 
innovation must be accounted for, generally implying local rather than global 
explanations (if any). 


Importantly, in both cases, product liability law broadens the scope of expla- 
nations vis-a-vis data protection law. While the GDPR focuses on the data 
subject as the recipient of explanations, product liability more broadly considers 
any explanations that may provide a safety benefit, targeting therefore particu- 
larly the operators of the AI systems who determine if, how and when a system 
is put to use. Hence, under product liability law producers have to consider to 
what extent explanations may help operators safely use the AI product. 


Product Monitoring Obligations. Finally, under EU law, producers are not sub- 
ject to product monitoring obligations once the product has been put onto the 
market. However, product liability law of some Member States does contain such 
monitoring obligations (e.g., Germany®). The producers, in this setting, have to 
keep an eye on the product to become aware of emerging safety risks, which is 
particularly important with respect to AI systems whose behavior might change 
after being put onto the market (e.g., via online learning). Arguably, expla- 
nations help fulfill this monitoring obligation. This, however, chiefly concerns 
explanations provided to the producer itself. If these are not shared with the 
wider public, trade secrets may be guarded; therefore, one might argue that 
even global explanations may be required. However, again, this would depend 
on the trade-off with the utility of the product as producers cannot be forced 
to put less utile products on the market unless the gain in safety, via local or 
global explanations, exceeds the potentially diminished utility. 


Results. In sum, product liability law targets the producer as the responsible 
entity, but primarily focuses on explanations provided to the party controlling 
the safety risks of the AI system in the concrete application context, typically 
the operator. To the extent that national law contains product monitoring obli- 
gations, however, explanations to the producer may have to be provided as well. 


3 BGH, 17.3.1981, VI ZR 286/78 - Benomy]. 
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In all cases, the risk reduction facilitated by the explanations must be weighed 
against the potentially reduced utility of the AI system. In this, product liability 
law aligns itself with technical and protective transparency. It generates pressure 
to offer AI systems with actionable explanations by targeting the supply side of 
the market (producers). 


2.2.2 General Negligence Standards 

Beyond product liability, general contract and tort law define duties of care that 
operators of devices, such as AI systems, need to fulfill in concrete deployment 
scenarios. Hence, it reaches the demand side of the market. While contract law 
covers cases in which the operator has a valid (pre-)contractual agreement with 
the harmed person (e.g., a physician with a patient; the bank with a credit 
applicant), tort law steps in if such an agreement is missing (e.g., autonomous 
lawnmower and injured pedestrian). However, the duties of care that relate to 
the necessary activities for preventing harm to the bodily integrity and the assets 
of other persons are largely equivalent under contract and tort law (see, e.g., [5, 
para 115]. In our context, this raises the question: do such duties of care require 
AI to be explainable, even if any specific contractual obligations to this end are 
lacking? 


From Error Reversal to Risk-Adequate Choice. Clearly, if the operator notices 
that the AI system is bound to make or has made an error, she has to overrule 
the AI decision to avoid liability [33,42,84]. Explanations geared toward the 
operator will often help her notice such errors and make pertaining corrections 
[80, p. 23][31]. For example, explanations could suggest that the system, in the 
concrete application, weighted features in an unreasonable manner and might 
fail to make a valid prediction [71,79]. What is unclear, however, is whether the 
duty of care more generally demands explanations as a necessary precondition 
for using AI systems. 

While much will depend on the concrete case, at least generally, the duty 
of care under both contract and tort law comprises monitoring obligations for 
operators of potentially harmful devices. The idea is that those who operate and 
hence (at least partially) control the devices in a concrete case must make rea- 
sonable efforts to control the risks the devices pose to third parties (cf. [101, para. 
459]). The scope of that obligation is similar to the one in product liability, but 
not directed toward the producer, but rather the operator of the system: they 
must do whatever is factually possible and normatively reasonable and appro- 
priate to prevent harm by monitoring the system. Hence, to the extent possible 
the operator arguably has to choose, at the moment of procurement, an AI sys- 
tem that facilitates risk control. Again, this reinforces technical and protective 
transparency in the name of safety gains. If an AI system providing actionable 
explanations is available, such devices must therefore be chosen by the operator 
over non-explainable systems under the same conditions as in product liability 
law (i.e., if the explanation leads to an overall risk reduction justifying addi- 
tional costs). For example, the operator need not choose an explainable system 
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if the price difference to a non-explainable system constitutes an unreasonable 
burden. Note, however, that the operator, if distinct from the producer, cannot 
claim that trade secrets speak against an explainable version. 


Alternative Design Obligations? Nonetheless, we would argue that the operator 
is not under an obligation to redesign the AI system, i.e., to actively install or use 
explanation techniques not provided by the producer, unless this is economically 
and technically feasible with efforts proportionate to the expected risk reduction. 
Rather, the safety obligations of the operator will typically influence the initial 
procurement of the AI system on the market. For example, if there are several 
Al-based credit scoring systems available the operator would have to choose the 
system with the best risk utility trade-off, taking into account explainability on 
both sides of the equation (potential reduction in utility and potential reduction 
of risk). Therefore, general contract and tort law sets incentives to use explain- 
able AI systems similar to product liability, but with a focus on actions by, and 
explanations for, the operator of the AI system. 


Results. The contractual and tort-law duty of care therefore does not, other 
than in product liability, primarily focus on a potential alternative design of the 
system, but on prudently choosing between different existing AI systems on the 
market. Interpreted in this way, general contract and tort law generate market 
pressure toward the offer of explainable systems by targeting the demand side of 
the market (operators). Like product liability, however, they cater to technical 
and protective transparency. 


2.3 Banking Law: More Technical and Protective Transparency 


Finally, banking law provides for detailed regulation governing the develop- 
ment and application of risk scoring models. It therefore represents an under- 
researched, but in fact highly relevant area of algorithmic regulation, particu- 
larly in the case of credit scoring (see, e.g., [54]). Conceptually, it is intriguing 
because the quality requirements inherent in banking law fuse technical trans- 
parency with yet another legal and economic aim: the control of systemic risk 
in the banking sector. 


2.3.1 Quality Assurance for Credit Models 

Significant regulatory experience exists in this realm because econometric and 
statistical models have long since been used to predict risk in the banking sector, 
such as creditworthiness of credit applicants [25]. In the wake of the financial 
crisis following the collapse of the subprime lending market, the EU legislator 
has enacted encompassing regulation addressing systemic risks stemming from 
the banking sector. Since inadequate risk models have been argued to have con- 
tributed significantly to the scope and the spread of the financial crisis [4, p. 
243-245], this area has been at the forefront of the development of internal com- 
pliance and quality regimes - which are now considered for AI regulation as 
well. 
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In general terms, credit institutions regulated under banking law are required 
to establish robust risk monitoring and management systems (Art. 74 of Direc- 
tive 2013/36). More specifically, a number of articles in the Capital Require- 
ments Regulation 575/2013 (CRR) set out constraints for the quality assurance 
of banking scoring models. Perhaps most importantly, Article 185 CRR compels 
banks to validate the score quality (‘accuracy and consistency’) of models for 
internal rating and risk assessment, via a continuous monitoring of the function- 
ing of these models. Art. 174 CRR, in addition, specifies that: statistical models 
and ‘other mechanical methods’ for risk assessments must have good predictive 
power (lit. a); input data must be vetted for accuracy, completeness, appropri- 
ateness and representativeness (lit. b, c); models must be regularly validated (lit. 
d) and combined with human oversight (lit. e) (see [58, para. 1]; cf. [26, para. 
249]; [21, paras. 68, 256]; for similar requirement for medical products, see [84]). 

These provisions foreshadow many of the requirements the AIA proposed 
by the EU Commission now seeks to install more broadly for the regulation 
of AI. However, to the extent that Al-based credit scoring is used by banks, 
these provisions - other than the AIA - already apply to the respective models. 
While the responsible lending obligation contained in Article 8 of the Consumer 
Credit Directive 2008/48 only spells out generic duties to conduct creditworthi- 
ness assessments before lending decisions, Articles 174 and 185 CRR have com- 
plemented this obligation with a specific quality assurance regime. Ultimately, 
more accurate risk prediction is supposed to not only spare lenders and bor- 
rowers the transaction costs of default events, but also and perhaps even more 
importantly to rein in systemic risk in the banking sector by mitigating exposure. 
This, in turn, aims at reducing the probability of severe financial crises. 


2.3.2 Consequences for XAI 

What does this entail for explainable AI in the banking sector? While accu- 
racy (and model performance more generally) may be verified on the test data 
set in supervised learning settings without explanations relating to the relevant 
features for a prediction, explainability will, as mentioned, often be a crucial 
element for validating the generalizability of models beyond the test set (Art. 
174(d) CRR), and for enabling human review (Art. 174(e) CRR). In its inter- 
pretive guidelines for supervision and model approval, the European Banking 
Authority (EBA) therefore stipulates that banks must ‘understand the underly- 
ing models used’, particularly in the case of technology-enabled credit assessment 
tools [26, para. 53c]. More specifically, it cautions that consideration should be 
given to developing interpretable models, if necessary for appropriate use of the 
model [26, para. 53d]. 

Hence, the explainability of AI systems becomes a real compliance tool in 
the realm of banking law, an idea we shall return to in the discussion of the AIA. 
In banking law, explainability is intimately connected to the control of systemic 
risk via informed decision making of the individual actors. One might even argue 
that both local and global explainability are required under this perspective: 
local explainability helps determine accuracy in individual real-world cases for 
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which no ground truth is available, and global explanations contribute to the 
verification of the consistency of the scoring tool across various domains and 
scenarios. As these explanations are generated internally and only shared with 
supervisory authorities, trade secrets do not stand in the way. 

The key limitation of these provisions is that they apply only to banks in 
the sense of banking law (operating under a banking license), but not to other 
institutions not directly subject to banking regulation, such as mere credit rat- 
ing agencies [7]. Nevertheless, the compliance and quality assurance provisions 
of banking law seem to have served as a blue print for current AI regulation 
proposals such as the EU Artificial Intelligence Act (esp. Art. 9, 14, 15 and 17), 
to which we now turn. 


3 Regulatory Proposals at the EU Level: The AIA 


The AIA, proposed by the EU Commission in April 2021, is set to become a cor- 
nerstone of AI regulation not only in the EU, but potentially with repercussions 
on a global level. Most notably, it subscribes to a risk-based approach and there- 
fore categorically differentiates between several risk categories for AI. Figure 2 
offers a snapshot of the connections between the functions of transparency and 
various Articles of the AIA. 
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Fig. 2. Overview of the functions of different Articles of the AIA transparency provi- 
sions 


3.1 AI with Limited Risk: Decision-Enabling Transparency (Art. 52 
AIA)? 


For specific AI applications with limited risk, Article 52 AIA spells out trans- 
parency provisions in an enabling but highly constrained spirit (see also [38,95]). 
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Thus, the providers of AI systems interacting with humans, of emotion recogni- 
tion systems, biometric categorization systems and of certain AI systems meant 
to manipulate images, audio recordings or videos (e.g., deep fakes) need to dis- 
close the fact that an AI system is operating and, in the last case, that content 
was manipulated. Transparency, in this sense, does not relate to the inner work- 
ings of the respective AI systems, but merely to their factual use and effects. 

The aim of these rules arguably is also of an enabling nature, but primarily 
with respect to informed choice, or rather informed avoidance (decision-enabling 
transparency), not the exercise of rights. Whether these rules will have any mean- 
ingful informational and behavioral effect on affected persons, however, must at 
least be doubted. A host of studies document rational as well as boundedly 
rational ignorance of standard disclosures in digital environments [1, 13,72]. But 
regardless of the individual benefit, the more or less complete information about 
the use of low-risk AI systems alone is indirectly helpful in providing overviews 
and insights to civil society initiatives or journalistic projects, for example. More- 
over, in the specific case of highly controversial AI applications such as emotion 
recognition or remote biometric identification, compulsory disclosure might, via 
coverage by media and watchdogs, engender negative reputational effects for the 
providers, which may lead some of them to reconsider the use of such systems 
in the first place. 


3.2 AI with High Risk: Encompassing Transparency (Art. 13 AIA)? 


The regulatory environment envisioned by the AIA is strikingly different for 
high-risk AI applications. Such applications are supposed to be defined via a 
regularly updated Annex to the AIA and, according to the current proposal, 
comprise a wide variety of deployment scenarios, from remote biometric identifi- 
cation to employment and credit scoring contexts, and from the management of 
critical infrastructure to migration and law enforcement (see Annex III AIA). In 
this regard, the question of the process of updating the AIA Annex is still open 
in terms of participation and public consultation. The requirements for low-risk 
AI systems to at least document the use and effects of the selected technologies, 
however, leads us to expect case-related disputes about whether an AI applica- 
tion should be classified as high risk, in which stakeholder representatives, civil 
and human rights protection initiatives, and manufacturers and users of tech- 
nologies will wrestle with each other. This public struggle can also be seen as a 
rights-enabling transparency measure. 


3.2.1 Compliance-Oriented Transparency 
For such high-risk applications, Article 13 AIA spells out a novel transparency 
regime that might be interpreted as seeking to fuse, to varying degrees, the 
several instrumentalist approaches identified in this paper, while notably fore- 
grounding another goal of transparency: legal compliance. 

Hence, Article 13(1) AIA mandates that high-risk AI systems be ‘sufficiently 
transparent to enable users to interpret the system’s output and use it appro- 
priately’. In this, an ‘appropriate type and degree of transparency’ must be 
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ensured. The provision therefore acknowledges the fundamentally different vari- 
eties of explanations that could be provided for AI systems, such as local, global 
or counterfactual explanations; or more or less granular information on feature 
weights. The exact scope and depth of the required transparency is further elab- 
orated upon in Article 13(3) AIA and will need to be determined in a context- 
specific manner. Nothing in the wording of Article 13, however, suggests that 
global explanations, which may be problematic for complex AI systems, must 
be provided on a standard basis. However, explanations must be faithful to 
the model in the sense that they need to be an, at least approximately, cor- 
rect reconstruction of the internal decision making parameters: explanation and 
explanandum need to match [57]. For example, local ex post explanations would 
have to verifiably and, within constraints, accurately measure feature relevance 
(or other aspects) of the used model. 

Notably, with respect to the general goal of transparency, the additional 
explanatory language in Article 13(1) AIA introduces a specific and arguably 
novel variety of transparency instrumentalism geared toward effective and com- 
pliant application of AI systems in concrete settings. In fact, Article 13(1) AIA 
defines a particular and narrow objective for appropriate transparency under 
the AIA: facilitating the fulfillment of the obligations providers and users have 
under the very AIA (Chap.3 = Art. 16-29). Most notably, any reference to 
rights of users or affected persons is lacking; rather, Article 29 AIA specifies 
that users may only deploy the AI system within the range of intended purposes 
specified by the provider and disclosed under Article 13(2) AIA. Hence, trans- 
parency under the AIA seems primarily directed toward compliance with the 
AIA itself, and not towards the exercise of rights affected persons might have. 
In this sense, the AIA establishes a novel, self-referential, compliance-oriented 
type of transparency instrumentalism. 


3.2.2 Restricted Forms of Enabling and Protective Transparency 
For specific applications, the recitals, however, go beyond this restrained com- 
pliance conception and hold that, for example in the context of law enforcement, 
transparency must facilitate the exercise of fundamental rights, such as the right 
to an effective remedy or a fair trial (Recital 38 AIA). This points to a more 
encompassing rights-enabling approach, receptive of demands for contestability, 
which stands in notable tension, however, with the narrower, compliance-oriented 
wording of Article 13(1) AIA. To a certain extent, however, the information 
provided under Article 13 AIA will facilitate audits by supervisory authorities, 
collective redress organizations or NGOs (‘review-enabling transparency’). 
Furthermore, the list of specific items that need to be disclosed under Article 
13(3) AIA connects to technical and protective instrumentalist conceptions of 
transparency (see also [41]). Hence, Article 15 AIA mandates appropriate levels 
of accuracy, as well as robustness and cybersecurity, for high-risk AI systems. 
According to Article 13(3)(b)(ii) AIA, the respective metrics and values need 
to be disclosed. In this, the AIA follows the reviewed provisions of banking law 
in installing a quality assurance regime for AI models whose main results need 
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to be disclosed. As mentioned, this also facilitates legal review: if the disclosed 
performance metrics suggest a violation of the requirements of Article 15 AIA, 
the supervisory authority may exercise its investigative and corrective powers. 
The institutional layout of this oversight and supervisory regime however is still 
not fully defined: The sectoral differentiation of AI applications in the AIA’s risk 
definitions on the one hand suggest an equally sectoral organization of supervi- 
sory authorities; the technical and procedural expertise needed for such oversight 
procedures on the other hand calls for a less distributed supervisory regime. 

Similarly, Article 10 AIA installs a governance regime for AI training data, 
whose main parameters, to the extent relevant for the intended purpose, also 
need to be divulged (Art. 13(3)(b)(v) AIA). Any other functionally relevant lim- 
itations and predetermined changes must be additionally informed about (Art. 
13(3)(b) (iii), (iv), (c) and (e)). Finally, disclosure also extends to human over- 
sight mechanisms required under Article 14 AIA - like the governance of training 
data another transplant from the reviewed provisions on models in banking law. 
Such disclosures, arguably, cater to protective transparency as they seek to guard 
against use of the AI system beyond its intended purpose, its validated perfor- 
mance or in disrespect of other risk-minimizing measures. 

Hence, transparency under Article 13 is intimately linked to the require- 
ments of human oversight specified in Article 14 AIA. That provision establishes 
another important level of protective transparency: high-risk AI applications 
need to be equipped with interface tools enabling effective oversight by human 
persons to minimize risks to health, safety and fundamental rights. Again, as dis- 
cussed in the contract/tort and banking law sections, local explanations partic- 
ularly facilitate monitoring and the detection of inappropriate use or anomalies 
engendering such risks (cf. Art. 14(4)(a) AIA). While it remains a challenge to 
implement effective human oversight in AI systems making live decisions (e.g., 
in autonomous vehicles), the requirement reinforces the focus of the AIA on 
transparency vis-a-vis professional operators, not affected persons. 


3.3 Limitations 


The transparency provisions in the AIA in several ways represent steps in the 
right direction. For example, they apply, other than the GDPR rules reviewed, 
irrespective of whether decision making is automated or not and of whether 
personal data is processed or not. Furthermore, the inclusion of a quality assur- 
ance regime should be welcomed and even be (at least partially) expanded to 
non-high-risk applications, as disclosure of pertinent performance metrics may 
be of substantial signaling value for experts and the market. Importantly, the 
rules of the future AIA (and of the proposed Machinery Regulation) will likely 
at least generally constitute minimum thresholds for the avoidance of design 
defects in product liability law (see Sect. 2.2.1), enabling decentralized private 
enforcement next to the public enforcement foreseen in the AIA. Nonetheless, 
the transparency provisions of the AIA are subject to significant limitations. 
First and foremost, self-referential compliance and protective transparency 
seems to detract from meaningful rights-enabling transparency for affected per- 
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sons. Notably, the transparency provisions of Article 13 AIA are geared exclu- 
sively toward the users of the system, with the latter being defined in Article 
3(4) AIA as anyone using the system with the exception of consumers. While 
this restriction has the beneficial effect of sparing consumers obligations and 
liability under the ATA (cf. [102]), for example under Article 29 ATA, it has the 
perhaps unintended and certainly significant effect of excluding non-professional 
users from the range of addressees of explanations and disclosure [27,91]. There- 
fore, the enabling variety of transparency, invoked in lofty words in Recital 38 
AIA, is missing from the Articles of the AIA and will in practice be largely rele- 
gated to other, already existing legal acts - such as the transparency provisions 
of the GDPR reviewed above. In this sense, the AIA does not make any sig- 
nificant contribution to extending or sharpening the content of the requirement 
to provide ‘meaningful information’ to data subjects under the GDPR. In this 
context, information facilitating a review in terms of potential bias with respect 
to protected groups is missing, too. 

Second, this focus on professional users and presumed experts continues in 
the long list of items to be disclosed under Article 13(3) AIA. While performance 
metrics, specifications about training data and other disclosures do provide rel- 
evant information to sophisticated users to determine whether the AI system 
might present a good fit to the desired application, such information will only 
rarely be understandable and actionable for users without at least a minimal 
training in ML development or practice. In this sense, transparency under the 
AIA might be described as transparency ‘by experts for experts’, likely lead- 
ing to information overload for non-experts. The only exception in this sense 
is the very reduced, potentially decision-enabling transparency obligation under 
Article 52 ATA. 

Third, despite the centrality of transparency for trustworthy AI in the com- 
munications of the EU Commission (see, e.g., European Commission, 2020), the 
AIA contains little incentive to actually disclose information about the inner 
workings of an AI system to the extent that they are relevant and actionable 
for affected persons. Most of the disclosure obligations refer either to the mere 
fact that an AI system of a specific type is used (Art. 52 AIA) or to descrip- 
tions of technical features and metrics (Art. 13(3) AIA). Returning briefly to 
the example of credit scoring, the only provision potentially impacting the ques- 
tion of whether local or even global explanations of the scores (feature weights) 
are compulsory is the first sentence of Article 13(1) AIA. According to it, users 
(i.e., professionals at the bank or credit scoring agency) must be able to inter- 
pret the system’s output. The immediate reference, in the following sentence, to 
the obligations of users under Article 29 AIA, however, detracts from a reading 
that would engage Article 13 AIA to provide incentives for clear and actionable 
explanations beyond what is already contained in Articles 13-15 GDPR. The 
only interpretation potentially suggesting local, or even global, explanations is 
the connection to Article 29(4) AIA. Under this provision, users have to monitor 
the system to decide whether use according to the instructions may nonetheless 
lead to significant risks. One could argue that local explanations could be con- 
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ducive to and perhaps even necessary for this undertaking to the extent that 
they enable professional users to determine if the main features used for the pre- 
diction were at least plausibly related to the target, or likely rather an artifact 
of the restrictions of training, e.g., of overfitting on training data (cf. [79]). Note, 
however, that for credit institutions regulated under banking law, the specific 
provisions of banking law take precedence over Article 29(4) and (5) AIA. 
Fourth, while AI systems used by banks will undergo a conformity assessment 
as part of the supervisory review and evaluation process already in place for 
banking models (Art. 43(2)(2) AIA), the providers of the vast majority of high- 
risk AI systems will be able to self-certify the fulfilment of the criteria listed 
in the AIA, including the transparency provisions in Art. 13 (see Art. 43(2)(1) 
AIA). The preponderance of such self-assessment may result from an endeavor to 
exonerate regulatory agencies and to limit the regulatory burden for providers, 
but it clearly reduces enforcement pressure and invites sub-optimal compliance 
with the already vague and limited transparency provisions (cf. also [91,95]). 
In sum, the AIA provides for a plethora of information relevant for sophisti- 
cated users, in line with technical transparency, but will disappoint those that 
had hoped for more guidance on and incentives for meaningful explanations 
enabling affected persons to review and contest the output of AI systems. 


4 Beyond Explainability 


As the legal overview has shown, different areas of law embody different concep- 
tions of AI explainability. Perhaps most importantly, however, if explanations 
are viewed as a social act enabling a dialogical exchange and laying the basis for 
goal-oriented actions of the respective recipients, it will often not be sufficient to 
just provide them with laundry lists of features, weights or model architectures. 
There is a certain risk that the current drive toward explainable AI, particularly 
if increasingly legally mandated, generates information that does not justify the 
transaction costs it engenders. Hence, computer science and the law have to go 
beyond mere explainability toward interactions that enable meaningful agency 
of the respective recipients [103], individually, but even more so by strengthening 
the ability of stakeholder organizations or civil and human rights organizations. 
This includes a push for actionable explanations, but also for connections to 
algorithmic fairness, to quality benchmarking and to co-design strategies in an 
attempt to construct responsible, trustworthy AI [3,45]. 


4.1 Actionable Explanations 


The first desideratum, therefore, is for explanations to convey actionable infor- 
mation, as was stressed throughout the article. Otherwise, for compliance reasons 
and particularly under the provisions of the AIA, explanations might be provided 
that few actors actually cognitively process and act upon. This implies a shift 
from a focus on the technical feasibility of explanations toward, with at least 
equal importance, the recipient-oriented design of the respective explanations. 
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4.1.1 Cognitive Optimization 

Generally, to be actionable, explanations must be designed such that informa- 
tion overload is avoided, keeping recipients with different processing capabilities 
in mind. This is a lesson that can be learned from decades of experience with 
the disclosure paradigm in US and EU consumer law: most information is flatly 
ignored by consumers [8,72]. To stand a chance of being cognitively processed, 
the design of explanations must thus be recipient-oriented. In this, a rich litera- 
ture on enhancing the effectiveness of privacy policies and standard information 
in consumer and capital markets law can be exploited [10,64]. Information, in 
this sense, must be cognitively optimized for the respective recipients, and the 
law, or at least the implementing guidelines, should include rules to this effect. 

To work, explanations likely must be salient and simple [93] and include 
visualizations [48]. Empirical studies indeed show that addressees prefer sim- 
ple explanations [78]. Furthermore, when more complex decisions need to be 
explained, information could be staggered by degree of complexity. Research 
on privacy policies, for example, suggests that multi-layered information may 
bridge the gap between diverging processing capacities of different actors [83]. 
Hence, simple and concise explanations could be given first, with more detailed, 
expert-oriented explanations provided on a secondary level upon demand. For 
investment information, this has already been implemented with the mandate 
on a Key Investor Document in EU Regulation 1286/2014 (PRIIPS Regulation) 
(see also [54, p.540]). Finally, empirical research again shows that actionable 
explanations tend to be contrastive, a concept increasingly explored in AI expla- 
nations as well [64,65]. 

Hence, there are no one-size-fits-all explanations; rather, they need to be 
adapted to different contexts and addressees. What the now classic literature on 
privacy policies suggests is that providing information is only one element of a 
more general privacy awareness and privacy-by-design strategy [44] that takes 
different addressees, practical needs and usable tools into account: A browser- 
plugin notifying about ill-defined or non-standard privacy settings can be more 
helpful for individual consumers than a detailed and descriptive walk-through 
of specific privacy settings. A machine-readable and standardized format for 
reviewing and monitoring privacy settings, however, is helpful for more technical 
reviews by privacy advocacy organizations. The ‘ability to respond’ to different 
contexts and addressees therefore is a promising path towards ‘response-able’ 
[51] AI. One particular strategy might be to let affected persons choose foils 
(within reasonable constraints) and generate contrastive explanations bridging 
the gap between fact and foil. 


4.1.2 Goal Orientation 

Beyond these general observations for cognitive optimization, actionable expla- 
nations should be clearly linked to the respective goals of the explanations. If the 
objective is to enable an understanding of the decision by affected persons and 
to permit the exercise of rights or meaningful review (rights- or review-enabling 
transparency), shortlists of the most relevant features for the decision ought to 
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be required [79][12, for limitations]. This facilitates, inter alia, checks for plau- 
sibility and discrimination. Importantly, such requirements have, in some areas, 
already been introduced into EU law by recent updates of consumer and business 
law. Under the new Art. 6a of the Consumer Rights Directive and the new Art. 
7(4a) of the Unfair Commercial Practices Directive, online marketplaces will 
shortly need to disclose the main parameters for any ranking following a search 
query, and their relative importance. Art. 5 of the P2B Regulation 2019/1150 
equally compels online intermediaries and search engines to disclose the main 
parameters of ranking and their relative importance. However, these provisions 
require global, not local explanations [37, p.52][14, p.161]. 

This not only generates technical difficulties for more complex AI systems, 
but the risk that consumers will flatly ignore such global explanations is arguably 
quite high. Rather, in our view, actionable information should focus on local 
explanations for individual decisions. Such information not only seems to be 
technically easier to provide, but it is arguably more relevant, particularly for 
the exercise of individual rights. From a review-enabling perspective, local infor- 
mation could be relevant as well for NGOs, collective redress organizations and 
supervisory authorities seeking to prosecute individual rights violations. In this 
sense, a collective dimension of individual transparency emerges (cf. also [46]). 
On the downside, however, local feature relevance information may produce a 
misleading illusion of simplicity; in non-linear models, even small input changes 
may alter principal reason lists entirely [12,57]. 

If, therefore, the goal is not to review or challenge the decision, but to facili- 
tate market decisions and particularly to create spaces for behavioral change of 
affected persons (decision-enabling transparency), for example to improve their 
credit score, counterfactual or contrastive information might serve the purpose 
better [65,97]. In the example of credit scoring, this could set applicants toward 
the path of credit approval. Such information could be problematic, however, if 
the identified features merely correlate with creditworthiness, but are not causal 
for it. In this case, the risk of applicants trying to ‘game the system’ by arti- 
ficially altering non-causal features are significant (e.g., putting felt tips under 
furniture as predictors of creditworthiness [85, p.71]). Moreover, in highly dimen- 
sional systems with many features, many counterfactuals are possible, making 
it difficult to choose the most relevant one for the affected person [97, p.851]. 
In addition, some counterfactually relevant features may be hard or impossible 
to change (e.g., age, residence) [50]. In these cases, local shortlists of the most 
relevant features [79] or minimal intervention advice [50] might be more helpful. 

Overall, research for the type of explanation with the best fit for each context 
will have to continue; it will benefit from cross-fertilization with social science 
research on the effectiveness of information more generally and explanations 
more particularly [64] as well as with research in science & technology studies 
on organizational, institutional and cultural contextualization of decision sup- 
port, explanations, and accountability. Ultimately, a context-dependent, goal- 
oriented mix of explanations (e.g., relevance shortlist combined with counter- 
factual explanation) might best serve the various purposes explanations have 


Varieties of AI Explanations Under the Law 365 


to fulfil in concrete settings. In this, a critical perspective drawing on the limi- 
tations of the disclosure paradigm in EU market law (see, e.g., {11,39]) should 
be helpful to prevent information overload and to limit disclosure obligations to 
what is meaningfully oriented to the respective goals of the explanations. 


4.2 Connections to Algorithmic Fairness 


Transparency, and explanations such as disclosure of the most relevant features 
of an AI output, may serve yet another goal: non-discrimination in algorithmic 
decision making. A vast literature deals with tools and metrics to implement 
non-discrimination principles at the level of AI models to facilitate legal compli- 
ance [52,76,106]. Explanations may reinforce such strategies by facilitating bias 
detection and prevention, both by affected persons and review institutions. For 
example, in the case of credit scoring, disclosure of the most important features 
(local explanations) could help affected persons determine to what extent the 
decision might have been driven by variables closely correlated with protected 
attributes [3]. Such cross-fertilization between bias detection and explanations 
could be termed ‘fairness-enabling transparency’ and should constitute a major 
research goal from a legal and technical perspective. 

In a similar vein, Sandra Wachter and colleagues have convincingly advo- 
cated for the disclosure of summary statistics showing the distribution of scores 
between different protected groups [98]. As one of the authors of this contribu- 
tion has argued, such disclosures might in fact already be owed under the cur- 
rent GDPR disclosure regime (Art. 13(2)(f), Art. 14(2)(g), Art. 15(1)(h) GDPR: 
information about the ‘significance and envisaged consequences’ of processing, 
see [40, p.1173-1174]). In addition, Art. 13(3)(b)(iv) AIA proposes the disclo- 
sure of a high-risk AI system’s ‘performance as regards the persons or groups of 
persons on which the system is intended to be used”. While one could interpret 
this as a mandate for differential statistics concerning protected groups, such an 
understanding is unlikely to prevail, in the current version of the AIA, as a ref- 
erence to protected attributes in the sense of antidiscrimination law is patently 
lacking. Fairness-enabling transparency, such as summary statistics showing dis- 
tributions between protected groups, to the extent available, thus constitutes an 
area that should be included in the final version of the AIA. 


4.3 Quality Benchmarking 


Finally, technical and protective transparency closely relates to (the disclosure 
of) quality standards for AI systems. These metrics, in turn, also enable regula- 
tory review and are particularly important, as seen, in banking law [54, p.561- 
563]. Two aspects seem to stand out at the intersection of explanations and 
quality benchmarking: 

First, an absolute quality control, such as the one installed in Art. 174/185 
CRR, could be enshrined for all AI applications, at least in medium- and high- 
stakes settings (transcending the ultimately binary logic of the AIA with respect 
to risk classification). In these settings, quality assurance might be considered as 
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important as, or even more important than, mere explainability. Quality control 
would include, but not be limited to, explanations facilitating decisions about the 
generalizability of the model (e.g., local explanations). Importantly, the disclo- 
sure of performance metrics would also spur workable competition by enabling 
meaningful comparison between different AI systems. Notably, relevant qual- 
ity assurance provisions in the AIA (Art. 10/15 AIA) are limited to high-risk 
applications. An update of the AIA might draw inspiration from banking law in 
working toward a quality assurance regime for algorithmic decision making in 
which the monitoring of field performance and the assessment of the generaliz- 
ability of the model via explainability form an important regulatory constraint 
not only for high-risk but also for medium-risk applications, at least. 

Second, understanding the risks and benefits of, and generating trust in, AI 
systems should be facilitated by testing the quality of AI models against the 
benchmark of traditional (non-AI-based) methods (relative quality control). For 
example, a US regulator, the Consumer Financial Protection Bureau, ordered a 
credit scoring startup working with alternative data to provide such an analysis. 
The results were promising: according to the analysis, Al-based credit scoring 
was able to deliver cheaper credit and improved access, both generally and with 
respect to many different consumer subgroups [30][35, p.42]. To the extent that 
the analysis is correct, it shows that AI, if implemented properly and monitored 
rigorously, may provide palpable benefits not only to companies using it, but to 
consumers and affected persons as well. Communicating such benefits by bench- 
marking reports seems a sensible way to enable more informed market decisions, 
to facilitate review and to generate trust - strengthening three important pillars 
of any explainability regime for AI systems. 


4.4 Interventions and Co-design 


Such ways of going beyond the already existing and currently proposed forms of 
transparency obligations by developing formats and methods to produce action- 
able explanations, by connecting transparency and explainability issues to ques- 
tions of algorithmic fairness and new or advanced forms of quality benchmarking 
and control are, as favorable as they are, mainly ex post mechanisms aiming at 
helping affected persons, users, NGOs or supervisory authorities to evaluate and 
act upon the outcomes of AI systems in use. They can inform market decisions, 
help affected persons to claim rights or enable regular oversight and supervi- 
sion, but they do not intervene in the design and implementation of complex 
Al systems. Linking to two distinct developments of inter- and transdisciplinary 
research can help to further develop forms of intervention and co-design: 

First, methods and formats for ‘values-in-design’ [53,70] projects have been 
developed in other areas of software engineering, specifically in human computer 
interaction (HCI) and computer-supported collaborative work (cscw) setups that 
traditionally deal with heterogenous user groups as well as with a diverse set of 
organizational and contextual requirements due to the less domain-specific areas 
of application of these software systems (see [32] for an overview). Formats and 
methods include the use of software engineering artifacts to make normative 
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requirements visible and traceable or the involvement of affected persons, stake- 
holders, or spokespersons in requirements engineering, evaluation and testing 
[32,75]. Technical transparency as discussed above can support the transfer and 
application of such formats and methods to the co-design of AI systems [2] with 
global explanations structuring the process and local explanations supporting 
concrete co-design practices. 

Second, these methodological advances have been significantly generalized 
and advanced under the 2014-2020 Horizon 2020 funding scheme, moving from 
‘co-design to ELSI co-design’ [56] and leading to further developing tools, meth- 
ods and approaches designed for research on SwafS (‘Science with and for Soci- 
ety’) into a larger framework for RRI (‘Responsible Research and Innovation’) 
[28]. In AI research, specifically in projects aiming to improve accountabil- 
ity or transparency, a similar, but still quite disconnected movement towards 
‘Responsible AI’ [24] has gained momentum, tackling very similar questions of 
stakeholder integration, formats for expert/non-expert collaboration, domain- 
knowledge evaluation or contestation and reversibility that have been discussed 
within the RRI framework with a focus on energy technologies, biotechnolo- 
gies or genetic engineering. This is a rich resource to harvest for further steps 
towards XAI by adding addressee orientation, contestability criteria or even, 
reflexively, tools to co-design explanations through inter- and transdisciplinary 
research [63,69]. 


5 Conclusion 


This paper has sought to show that the law, to varying degrees, mandates or 
incentivizes different varieties of AI explanations. These varieties can be dis- 
tinguished based on their respective functions or goals. When affected persons 
are the addressees, explanations should be primarily rights-enabling or decision- 
enabling. Explanations for operators or producers, in turn, will typically facil- 
itate technical improvements and functional review, fostering the mitigation of 
legally relevant risks. Finally, explanations may enable legal review if perceived 
by third parties, such as NGOs, collective address organizations or supervisory 
authorities. 

The GDPR, arguably, subscribes to a rights-enabling transparency regime 
under which local explanations may, depending on the context, have to be 
provided to individual affected persons. Contract and tort law, by contrast, 
strive for technical and protective transparency under which the potential trade- 
off between performance and explainability takes center stage: any potentially 
reduced accuracy or utility stemming from enforcing explanations must be 
weighed against the potential safety gains such explanations enable. Explana- 
tions are required only to the extent that this balance is positive. Banking law, 
finally, endorses a quality assurance regime in which transparency contributes to 
the control of systemic risk in the banking sector. Here, even global explanations 
may be required. The proposal for the AIA, in turn, is primarily geared toward 
compliance-oriented transparency for professional operators of AI systems. From 
a rights-enabling perspective, this is a significant limitation. 
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These legal requirements, however, can be interpreted to increasingly call 
for actionable explanations. This implies moving beyond mere laundry lists of 
relevant features toward cognitively optimized and goal-oriented explanations. 
Multi-layered or contrastive explanations are important elements in such a strat- 
egy. Tools, methods and formats from various values-in-design approaches as well 
as those developed under the umbrella term of ‘responsible research and inno- 
vation’ can help co-designing such systems and explanations. 

Finally, an update of the AIA should consider fairness-enabling transparency, 
which seeks to facilitate the detection of potential bias in AI systems, as well 
as broader provisions for quality benchmarking to facilitate informed decisions 
by affected persons, to enable critical review and the exercise of rights, and to 
generate trust in AI systems more generally. 
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Abstract. AI explainability is becoming indispensable to allow users to 
gain insights into the AI system’s decision-making process. Meanwhile, 
fairness is another rising concern that algorithmic predictions may be 
misaligned to the designer’s intent or social expectations such as dis- 
crimination to specific groups. In this work, we provide a state-of-the- 
art overview on the relations between explanation and AI fairness and 
especially the roles of explanation on human’s fairness judgement. The 
investigations demonstrate that fair decision making requires extensive 
contextual understanding, and AI explanations help identify potential 
variables that are driving the unfair outcomes. It is found that different 
types of AI explanations affect human’s fairness judgements differently. 
Some properties of features and social science theories need to be consid- 
ered in making senses of fairness with explanations. Different challenges 
are identified to make responsible AI for trustworthy decision making 
from the perspective of explainability and fairness. 


Keywords: Fairness - Explainable AI - Explainability - Machine 
learning 


1 Introduction 


Artificial Intelligence (AI) including Machine Learning (ML) algorithms are 
increasingly shaping people’s daily lives by making decisions with ethical and 
legal impacts in various domains such as banking, insurance, medical care, crim- 
inal justice, predictive policing, and hiring [43,44]. While Al-informed decision 
making can lead to faster and better decision outcomes, however, AI algorithms 
such as deep learning often use complex learning approaches and even their 
designers are often unable to understand why AI arrived at a specific decision. 
Therefore, AI remains a black box that makes it hard for users to understand 
why a decision is made or how the data is processed for the decision making 
[8, 44,45]. Because of the black box nature of AI models, the deployment of AI 
algorithms especially in high stake domains usually requires testing and verifi- 
cation for reasonability by domain experts not only for safety but also for legal 
reasons [35]. Users also want to understand reasons behind specific AI-informed 
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decisions. For example, high-stake domains require explanations of AI before 
any critical decisions, computer scientists use explanations to refine and further 
improve performance of AI algorithms, and AI explanations can also improve the 
user experience of a product or service by helping end-users trust that the AI is 
making good decisions [7]. As a result, the issue of AI explanation has experi- 
enced a significant surge in interest from the international research community 
to various application domains, ranging from agriculture to human health and 
is becoming indispensable in addressing ethical concerns and fostering trust and 
confidence in AI systems [20, 42, 43]. 

Furthermore, AI algorithms are often trained on a large amount of historical 
data, which may not only replicate, but also amplify existing biases or discrim- 
ination in historical data. Therefore, due to such biased input data or faulty 
algorithms, unfair Al-informed decision making systems have been proven to sys- 
tematically reinforce discrimination such as racial/gender biases in Al-informed 
decision making. These drive a distrust in and fear the use of AI in public dis- 
cussions [41]. 

In addition, the wide use of AI in almost every aspect of our life implies 
that with great powers comes great responsibility. Fairness shows that an AI 
system exhibits certain desirable ethical characteristics, such as being bias-free, 
diversity-aware, and non-discriminatory. While explanations to an AI system 
provide human-understandable interpretations of the inner working of the sys- 
tem and decisions. Both fairness and explanation are important components for 
building “Responsible AI”. For example, the fair treatment and/or fair outcome 
are important ethical issues that need to be considered in the algorithmic hir- 
ing decision making. How the decisions made by an algorithmic process can be 
explained in a transparent and compliant way is also necessary for ethical use 
of AI in the hiring [36]. Therefore, both fairness and explanations are important 
ethical issues that can be used to promote user trust in Al-informed decision 
making (see Fig. 1). 


Al Al 
Fairness Explanation 


ee” 


Fig. 1. Relations among AI fairness, AI explanation, and trust. 


Towards Explainability for AI Fairness 377 


Previous research found that AI explanations are not only for human to 
understand the AI system, but also provide an interface for human in the loop, 
enabling them to identify and address fairness and other issues [12]. Furthermore, 
differences in AI outcomes amongst different groups in Al-informed decision 
making can be justified and explained via different attributes in some cases 
[27]. When these differences are justified and explained, the discrimination is 
not considered to be illegal [22]. Therefore, explanation and fairness have close 
relations in Al-informed decision making (as highlighted in orange colour in 
Fig. 1). Taken the talent recruiting as an example, disproportional recruitment 
rates for males and females may be explainable by the fact that more males 
may have higher education, and if males and females are treated equally, it 
will introduce reverse discrimination, which may be undesirable as well [22]. In 
another example on the annual income analysis [2], males have a higher annual 
income than females on average in the data. However, this does not mean that 
there is a discrimination to females in the annual income because females have 
fewer work hours than males per week on average. Therefore, the explanation to 
the difference of the annual income between males and females with the use of 
work hours per week helps the outcomes of annual income acceptable, legal and 
fair [22]. It shows that fairness and explanation are tightly related to each other. 
Therefore, it is significant to understand how AI explanations impact the fairness 
judgement or how the AI fairness enhances AI explanations. This paper aims 
to investigate state-of-the-art research in these areas and identifies key research 
challenges. The contributions of the paper include: 


— The relations between explanability and AI fairness are identified as one of 
significant components for the responsible use of AI and trustworthy decision 
making. 

— A systematic analysis on the explanabillitty and AI fairness to learn the 
current status of explanability for the human’s fairness judgement; 

— The challenges and future research directions on the explanability for AI 
fairness are identified. 


2 Fairness 


Fairness has become a key element in developing socio-technical AI systems 
when AI is used in various decision making tasks. In the context of decision- 
making, fairness is defined as the absence of any prejudice or favoritism towards 
an individual or a group based on their inherent or acquired characteristics [27, 
33]. An unfair algorithm is one whose decisions are skewed toward a particular 
group. Fairness can be considered from at least four aspects [10]: 1) protected 
attributes such as race, gender, and their proxies, are not explicitly used to make 
decisions; 2) common measures of predictive performance (e.g., false positive and 
false negative rates) are equal across groups defined by the protected attributes; 
3) outcomes are independent of protected attributes; and 4) treat similarly risky 
people similarly. 
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There are two potential sources of unfairness in machine learning outcomes: 
those arising from biases in data and those arising from algorithms. Mehrabi et 
al. [27] summarised 23 types of data biases that may result in fairness issues in 
machine learning: historical bias, representation bias, measurement bias, eval- 
uation bias, aggregation bias, population bias, Simpson’s paradox, longitudinal 
data fallacy, sampling bias, behavioural bias, content production bias, linking 
bias, temporal bias, popularity bias, algorithmic bias, user interaction bias, social 
bias, emergent bias, self-selection bias, omitted variable bias, cause-effect bias, 
observer bias, and funding bias. Different kinds of discrimination that may occur 
in algorithmic decision making are also categorised by Mehrabi et al. [27] such as 
direct discrimination, indirect discrimination, systemic discrimination, statisti- 
cal discrimination, explainable discrimination, and unexplainable discrimination. 
Different metrics have been developed to measure AI fairness quantitatively and 
various approaches have been proposed to mitigate AI biases [6]. For example, 
statistical parity difference is defined as the difference of the rate of favorable 
outcomes received by the unprivileged group to the privileged group, and equal 
opportunity difference is defined as the difference of true positive rates between 
the unprivileged and the privileged groups. The true positive rate is the ratio of 
true positives to the total number of actual positives for a given group. 

Since the disconnection between the fairness metrics and practical needs of 
society, politics, and law [21], Lee et al. [24] presented that the relevant contex- 
tual information should be considered in an understanding of a model’s ethical 
impact, and fairness metrics should be framed within a broader view of ethical 
concerns to ensure their adoption for a contextually appropriate assessment of 
each algorithm. 

As AI is often used by humans and/or for human-related decision mak- 
ing, people’s perception of fairness is required to be taken into account when 
designing and implementing Al-informed decision making systems [38]. Follow- 
ing this, people’s perception of fairness has been investigated along four dimen- 
sions: 1) algorithmic predictors, 2) human predictors, 3) comparative effects 
(human decision-making vs. algorithmic decision-making), and 4) consequences 
of Al-informed decision making [38]. 


3 AI Explanation 


The AI explainability has been reviewed thoroughly in recent years [7,44], which 
are based on the explanation-generation approaches, the type of explanation, 
the scope of explanation, the type of model it can explain or combinations of 
these methods as well as others [1]. For example, explanation methods can be 
grouped into pre-model, in-model, and post-model methods by considering when 
explanations are applicable; there are also intrinsic and post-hoc explanation 
methods by considering whether explainability is achieved through constraints 
imposed on the AI model directly (intrinsic) or by applying explanation methods 
that analyse the model after training (post-hoc). Other types of explanations 
include model-specific and model-agnostic methods, as well as global and local 
explanation methods. 
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Miller [28] emphasised the importance of social science in AI explanations 
and found that 1) Explanations are contrastive and people do not ask why an 
event happened, but rather why this event happened instead of another event; 
2) Explanations are selected in a biased manner. People are adept at selecting 
one or two causes from an infinite number of causes to be the explanation, which 
could be influenced by certain cognitive biases; 3) Probabilities probably don’t 
matter. Explanations with statistical generalisations are unsatisfying and the 
causal explanation for the generalisation itself is usually effective; 4) Explana- 
tions are social. They are a transfer of knowledge to people and act as part of 
a conversation or interaction with people. Therefore, explanations are not just 
the presentation of associations and causes to predictions, they are contextual. 

Wang et al. [39] highlighted three desirable properties that ideal AI expla- 
nations should satisfy: 1) improve people’s understanding of the AI model, 2) 
help people recognize the model uncertainty, and 3) support people’s calibrated 
trust in the model. Therefore, different approaches are investigated to evaluate 
whether and to what extent the offered explainability achieves the defined objec- 
tive [44]. Objective and subjective metrics are proposed to evaluate the quality of 
explanations, such as clarity, broadness, simplicity, completeness, and soundness 
of explanations, as well as user trust. For example, Schmidt and Biessmann [34] 
presented a quantitative measure for the quality of explanation methods based 
on how faster and accurate decisions indicate intuitive understanding, i.e. the 
information transfer rate which is based on mutual information between human 
decisions and model predictions. [34] also argued that a trust metric must cap- 
ture cases in which humans are too biased towards the decisions of an AI system 
and overly trust the system, and presented a quantitative measure for trust by 
considering the quality of AI models (see Eq. 1). 


MI% 
= (1) 
MIy 


where T is the trust metric, MIẹ is the mutual information between human 
decisions and model predictions and M/Jy is the mutual information between 
human decisions and true labels. 

Despite the extensive investigations of AI explanations, they still face differ- 
ent challenges [29]. For example, similar to AI models, uncertainty is inherently 
associated with explanations because they are computed from training data or 
models. However, many AI explanation methods such as feature importance- 
based approaches provide explanations without quantifying the uncertainty of 
the explanation. Furthermore, AI explanations, which should ideally reflect the 
true causal relations [17], mostly reflect statistical correlation structures between 
features instead. 


4 Explanation for AI Fairness 


As discussed previously, fairness and explanation are strongly dependent. Decid- 
ing an appropriate notion of fairness to impose on AI models or understanding 
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whether a model is making fair decisions require extensive contextual under- 
standing and domain knowledge. Shin and Park [37] investigated the role of 
Fairness, Accountability, and Transparency (FAT) in algorithmic affordance. It 
showed that FAT issues are multi-functionally related, and user attitudes about 
FAT are highly dependent on the context in which it takes place and the basis 
who is looking at. It also showed that topics regarding FAT are somehow related 
and overlapping, making them difficult to distinguish or separate. It demon- 
strated the heuristic role of FAT regarding their fundamental links to trust. 


4.1 Explanation Guarantees Fairness 


The explanation of the decision making is a way to gain insights and guaran- 
tee fairness to all groups impacted by Al-related decisions [13]. Lee et al. [24] 
argued that explanations may help identify potential variables that are driving 
the unfair outcomes. It is unfair if decisions were made without explanations or 
with unclear, untrusted, and unverifiable explanations [32]. For example, Begley 
et al. [5] introduced explainability methods for fairness based on the Shapley 
value framework for model explainability [25]. The proposed fairness explana- 
tions attribute a model’s overall unfairness to individual input features, even the 
model does not operate on protected/sensitive attributes directly. 

Warner and Sloan [40] argued that effective regulation to ensure fair- 
ness requires that AI systems be transparent. While explainability is one of 
approaches to acquire transparency. The explainability requires that an AI sys- 
tem provides a human-understandable explanation of why any given decision 
was reached in terms of the training data used, the kind of decision function, 
and the particular inputs for that decision. Different proxy variables of fairness 
are presented for the effective regulation of AI transparency in [40]. 


4.2 Influence of Explanation on Perception of Fairness 


Baleis et al. [3] showed that transparency, trust and individual moral con- 
cepts demonstrably have an influence on the individual perception of fairness 
in AI applications. Dodge et al. [12] investigated the impact of four types of 
AI explanations on human’s fairness judgments of AI systems. The four types of 
explanations are input influence-based explanation, demographic-based explana- 
tion, sensitivity-based explanation, and case-based explanation. It showed that 
case-based explanation is generally less fair. It was found that local explana- 
tions are more effective than global explanations for case-specific fairness issues. 
Sensitivity-based explanations are the most effective for the fairness issue of 
disparate impact. 


4.3 Fairness and Properties of Features 


Grgic-Hlaca et al. [14] proposed to understand why people perceive certain fea- 
tures as fair or unfair to be used in algorithms based on a case study of a criminal 
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risk estimation tool for the use to help make judicial decisions. Eight properties 
of features are identified, which are reliability, relevance, volitionality, privacy, 
causes outcome, causes vicious cycle, causes disparity in outcomes, and caused by 
sensitive group membership. It was found that people’s concerns on the unfair- 
ness of an input feature are not only discrimination, but also other consideration 
of latent properties such as the relevance of the feature to the decision making 
scenario and the reliability with which the feature can be assessed. In a further 
study, Grgic-Hlaca et al. [15] proposed measures for procedural fairness (the fair- 
ness of the decision making process) that consider the input features used in the 
decision process in the context of criminal recidivism. The analysis examined to 
what extent the perceived fairness of a characteristic is influenced by additional 
knowledge about increasing the accuracy of the prediction. It was found that 
input features that were classified as fairer were those that improved the accu- 
racy of prediction and those features as more unfair that led to discrimination 
against certain feature holders of people. 


4.4 Fairness and Counterfactuals 


The use of counterfactuals has become one of popular approaches for AI expla- 
nation and making sense of algorithmic fairness [4,26,44], which can require an 
incoherent theory of what social categories are [23]. 

However, it was argued that the social categories may not admit counter- 
factual manipulation, and hence may not appropriately satisfy the demands for 
evaluating the truth or falsity of counterfactuals [23], which can lead to mis- 
leading results. Therefore, the approaches used for algorithmic explanations to 
make sense of fairness also need to consider social science theories to support AI 
fairness and explanations. 

A good example of the use of counterfactuals [18] is algorithmic risk assess- 
ment [11]. Algorithmic risk assessments are increasingly being used to help 
experts make decisions, for example, in medicine, in agriculture or criminal jus- 
tice. The primary purpose of such AI-based risk assessment tools is to provide 
decision-relevant information for actions such as medical treatments, irrigation 
measures or release conditions, with the aim of reducing the likelihood of the 
respective adverse event such as hospital readmission, crop drying, or criminal 
recidivism. The advantage of the principle of machine learning, namely learn- 
ing from large amounts of historical data, is precisely counterproductive, even 
dangerous [19], here. 

Because such algorithms reflect the risk from decision-making policies of the 
past — but not the current actual conditions. To cope with this problem, [11] 
presents a new method for estimating the proposed metrics that uses doubly 
robust estimation and shows that only under strict conditions can fairness be 
provided simultaneously according to the standard metric and the counterfac- 
tual metric. Consequently, fairness-enhancing methods that aim for parity in 
a standard fairness metric can cause greater imbalance in the counterfactual 
analogue. 
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5 Discussion 


With the increasing use of AI in people’s daily lives for various decision making 
tasks, the fairness of Al-informed decisions and explanation of AI for decision 
making are becoming significant concerns for the responsible use of AI and trust- 
worthy decision making. This paper focused on the relations between explanation 
and AI fairness and especially the roles of explanation on AI fairness. The inves- 
tigations demonstrated that fair decision making requires extensive contextual 
understanding. AI explanations help identify potential variables that are driving 
the unfair outcomes. Different types of AI explanations affect human’s fairness 
judgements differently. Certain properties of features such as the relevance of the 
feature to the decision making scenario and the reliability with which the feature 
can be assessed affect human’s fairness judgements. In addition, social science 
theories need to be considered in making sense of fairness with explanations. 
However, there are still challenges. For example, 


— Despite the requirements of the extensive contextual understanding for the 
fair decision making, it is hard to decide what contextual understanding is 
the appropriate to boost fair decision making. 

— There are various types of explanations. It is significant to decide what expla- 
nations that can promote the human’s fairness judgement on decision making 
as expected. While the human’s fairness judgement is highly related to users 
themselves, it is a challenge to justify what explanations are the best for 
human’s fairness judgement. 

— Since AI is applied in various sectors and scenarios, it is important to under- 
stand whether different application sectors or scenarios affect the effectiveness 
of explanations on the human’s judgement on perception in decision making. 


Investigating AI fairness explanations requires a multidisciplinary approach 
and must include research on machine learning [9], human-computer interaction 
[31] and social science [30] — regardless of the application domain - because the 
domain expert must always be involved and can bring valuable knowledge and 
contextual understanding [16]. 

All this provides us with clues for developing effective approaches to respon- 
sible AI and trustworthy decision-making in all future work processes. 


6 Conclusion 


The importance of fairness is undisputed. In this paper, we have explored the 
relationships between explainability, or rather explanation, and AI fairness, and 
in particular the role of explanation in AI fairness. We first identified the rela- 
tionships between explanation and AI fairness as one of the most important 
components for the responsible use of AI and trustworthy decision-making. The 
systematic analysis of explainability and AI fairness revealed that fair decision- 
making requires a comprehensive contextual understanding, to which AI expla- 
nations can contribute. Based on our investigation, we were able to identify sev- 
eral other challenges regarding the relationships between explainability and AI 
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fairness. We ultimately argue that the study of AI fairness explanations requires 
an important multidisciplinary approach, which is necessary for a responsible 
use of AI and for trustworthy decision-making - regardless of the application 
domain. 
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Abstract. This paper reviews logical approaches and challenges raised 
for explaining AI. We discuss the issues of presenting explanations as 
accurate computational models that users cannot understand or use. 
Then, we introduce pragmatic approaches that consider explanation a 
sort of speech act that commits to felicity conditions, including intelligi- 
bility, trustworthiness, and usefulness to the users. We argue Explainable 
AI (XATI) is more than a matter of accurate and complete computational 
explanation, that it requires pragmatics to address the issues it seeks to 
address. At the end of this paper, we draw a historical analogy to usabil- 
ity. This term was understood logically and pragmatically, but that has 
evolved empirically through time to become more prosperous and more 
functional. 


Keywords: Explainable AI - Pragmatics - Conversation - Causability 


1 Introduction 


Artificial intelligence (AI) technology has advanced many human-facing appli- 
cations in our daily lives. As one of the most widely used Al-driven intelligent 
systems, recommendation systems have been an essential part of today’s digital 
ecosystems. For example, recommendation systems have been widely adopted 
for suggesting relevant items or people to the users on social media [8]. Bil- 
lion people have adopted or interacted with these AI systems every day. Effec- 
tive recommender systems typically exploit multiple data sources and ensemble 
intelligent inference methods, e.g., machine learning or data science approaches. 
However, it is usually difficult to comprehend the internal processes of how the 
recommendation was made for the end-users. The reasons of receiving specific 
recommendations usually stay in a black box, which frequently makes the result- 
ing recommendations less trustworthy to the users [1]. The users generally have 
little understanding of the mechanism behind these systems, so these recom- 
mendations are not yet transparent to the users. The opaque designs are known 
to negatively affect users’ satisfaction and impair their trust in the recommen- 
dation systems [25]. Moreover, in this situation, processing this output could 
produce user behavior that can be confusing, frustrating, or even dangerous in 
life-changing scenarios [1]. 
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We argue providing explainable recommendation models and interfaces may 
not assure the users will understand the underlying rationale, data, and logic 
[26]. The scientific explanations, which are based on accurate AI models, might 
not comprehensible to the users who are lack competent AI literacy. For instance, 
a software engineer would appreciate inspecting the approximated probability 
in a recommendation model. However, this information could be less meaningful 
or even overloaded to lay users with varied computational knowledge, beliefs, 
and even biases [2]. We believe the nature of an explanation is to help the users 
to understand and to build a working mental model of using AI applications 
in everyday lives [5]. We urgently need more work on empowering lay users by 
providing comprehensible explanations in AI applications to benefit from the 
daily collaboration with AI. 

In this paper, we aim to review logical approaches to Explainable AI (XAI). 
We would review the logic of explanation and challenges raised for explaining 
AI using generic algorithms. Specifically, we are interested in presenting such 
explanations to users, for instance, explaining accurate system models that users 
cannot understand or use. Then, we would discuss pragmatic approaches that 
consider explanation a sort of speech act that commits to felicity conditions, 
including intelligibility, trustworthiness, and usefulness to the listener. We argue 
XAI is more than a matter of accurate and complete explanation, that it requires 
pragmatics of explanation to address the issues it seeks to address. We then 
draw a historical analogy to usability. This term was understood logically and 
pragmatically, but that has evolved empirically through time to become more 
prosperous and functional. 


2 The Logic of Explanations 


Explainable AI (XAI) has drawn more and more attention in the broader field 
of human-computer interaction (HCI) due to the extensive social impact. With 
the popularity of Al-powered systems, it is imperative to provide users with 
effective and practical transparency. For instance, the newly initiated European 
Union’s General Data Protection Regulation (GDPR) requires the owner of any 
data-driven application to maintain a “right to the explanation” of algorith- 
mic decisions [7]. Enhancing transparency in AI systems has been studied in 
the XAI research to improve AI systems’ explainability, interpretability, or con- 
trollability [14,16]. Researchers have explored a range of user interfaces and 
explainable models to support exploring, understanding, explaining, and con- 
trolling recommendations [10,25,26]. In many user-centered evaluations, these 
explanations positively contribute to the user experience, i.e., trust, understand- 
ability, and satisfaction [25]. Self-explainable recommender systems have been 
proved to increase user perception of system transparency and acceptance of the 
system suggestions [14]. These explanations were usually post-hoc and one-shot 
with an obvious challenge of when, why, and how to explain the system to the 
users based on their information needs and beliefs. 

Another stream of research has identified the effects of making the recom- 
mendation process more transparent. It could improve the user’s conceptual 
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model by enhancing the recommendation system’s controllability [22,26]. In 
these attempts, users were allowed to influence the presented recommendations 
by interacting with different visual interfaces. The interactive recommender sys- 
tems demonstrated that users appreciate controllability in their interactions with 
the recommender systems [14]. The similar effects applied to visualization that 
users can understand how their actions can impact the system, which contributes 
to the overall inspectability [14] and causability [13] of the recommendation pro- 
cess. The transparent recommendation process could accelerate the information- 
seeking process but does not guarantee the comprehension of the target system’s 
inner logic. These solutions empowered the user to control the system for access- 
ing the desired recommendations. However, these controllable interfaces may not 
fulfill the explanation needs and help the users build a mental model to tell how 
the system works. 

The user’s mental model represents the knowledge of information systems 
generated and evolved through the interaction with the system [18]. The idea 
was founded in cognitive science and HCI discipline in the 1980s. For instance, 
Norman [21] argued the user could invent a mental model to simulate system 
behavior and make assumptions or predictions about the interaction outcome 
based on a target system. Follow Norman’s definition, the user’s mental mod- 
els are constructed, incomplete, limited, unstable and sometime “superstitions” 
[21]. The user’s mental model interacts with the conceptual model that the sys- 
tem designer used to develop the system. HCI researchers have considered the 
user’s mental model in designing the usable system or interfaces in the past two 
decades. However, only a few studies have examined the user’s mental model 
while interacting with the context of Al-powered recommender systems and algo- 
rithmic decisions [20]. 

We argue that these controllable and explainable user interfaces may not 
always ensure that users understand the underlying rationale of each contribut- 
ing data or method [26]. The users could perceive the system’s usefulness but 
still lack the predictability or causability [13] that to approximate the behav- 
ior of the target system [21]. In our observation, the users could build differ- 
ent mental models while interacting with an explainable system. For instance, 
users with more robust domain knowledge, such as trained computer sciences 
students, would be more judgmental in using the explainable system through 
their computational knowledge. However, the naive users would be more will- 
ing to accept and trust the recommendations [26]. We also observe controllable 
interfaces would lead the user to compare the recommendations in their decision- 
making process. Still, it does not mean the users could understand or predict 
the system’s underlying logic. These findings demonstrate that personal factors 
and mental models (such as education, domain experience, and familiarity with 
technology) could significantly affect the system’s user perception and cognitive 
process of machine-generated explanations. 
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3 The Pragmatics of Explanations 


Miller [18] and Mittelstadt et al. [19] suggest that the AI and HCI researchers 
need to differentiate scientific and everyday explanations. To provide the every- 
day explanations, researchers need to consider cross-discipline knowledge (e.g., 
HCI, social science, cognitive science, psychology, etc.) and the user’s mental 
model. Instead of the scientific intuition to provide prediction approximations 
(e.g., the global or local surrogate XAI models). For example, as HCI researchers, 
we already know the success explanation should be iterative, sound, complete, 
and not overwhelm the user. Social science researchers defined the everyday 
explanation through three principles [24]. 1) human explanations are contrastive: 
perceiving abnormality played an important role in seeking an explanation, i.e., 
the users would be more like to figure out an unexpected recommendation [18]. 
2) human explanations are selective: the users may not seek a “complete cause” 
of an event; instead, the users tend to seek useful information in the given con- 
text. The selective could reduce long causal chains’ effort and the cognitive load 
of processing countless modern AI models’ parameters. 3) human explanations 
are social: the process of seeking an explanation should be interactive, such as 
a conversation. The explainer and explained can engage in information transfer 
through dialogue or other means [12]. 

Specifically, we propose to explore the pragmatics of Explanations in AT, i.e., 
the known mechanism of how the user requests an explanation from AI applica- 
tions. The HCI community has long been interested in the interaction benefits of 
conversational interfaces. The design space could be situated within a rich body 
of studies on conversational agents or chatbot applications, e.g., Al-driven per- 
sonal assistant [17]. The design of conversational agents offers several advantages 
over traditional WIMP (Windows, Icons, Menus, and Pointers) interfaces. The 
interface could provide a natural and familiar way for users to tell the system 
about themselves, which improves the system’s usability and updates the user’s 
mental model to the system. Moreover, the design is flexible (like a dialogue) and 
can accommodate diverse user requests without requiring users to follow a fixed 
path (e.g., the controllable interfaces [26]). The interaction could augment by a 
personified persona, in which the anthropomorphic features could help attract 
user attention and gain user trust. 

In this section, we present two case studies to introduce our early investiga- 
tion on pragmatics of AI explanations. 


3.1 Case 1: Conversational Explanations 


Online symptom checkers (OSCs) are intelligent systems using machine learn- 
ing approaches (e.g., clinical decision tree) to help patients with self-diagnosis 
or self-triage [27]. These systems have been widely used in various health con- 
texts, e.g., patients could use OSCs to check their early symptoms. The patient 
could learn their symptoms before a doctor visit, and to identify the appropriate 
care level and services and whether they need medical attention from health- 
care providers [23]. The Al-powered symptom checkers promise various benefits, 
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Do any of these apply to you? Obesity, 
High blood pressure, Blood disorder, 
Smoking? 


Ok, this is my recommendation 


eef" Calla medical provider within 24 hours. 


Sorry you’re not feeling well. Your 
symptoms may be related to COVID-19. 
You also have medical conditions that 
may put you at risk of becoming more 
seriously ill. 


[e] You may be eligible for COVID-19 testing. 


Please visit the CDC COVID-19 website 
e1] for more information. 


Fig. 1. Example of the conversational AI explanations [27] 


such as providing quality diagnosis and reducing unnecessary visits and tests. 
However, unlike real healthcare professionals, most OSCs do not explain why 
the OSCs provide such diagnosis or why a patient falls into a disease classifi- 
cation. OSCs’ data and clinical decision models are usually neither transparent 
nor comprehensible to lay users. 

We argue explanations could be used to promote diagnostic transparency 
of online symptom checkers in a conversational manner. First, we conducted 
an interview study to explore what explanation needs exist in the existing use 
of OSCs. Second, informed by the first study’s results, we used a design study 
to investigate how explanations affect the user perception and user experience 
with OSCs. We designed an COVID-19 OSC (shown in Fig.1) and tested it 
with three styles of explanations in a lab-controlled study with 20 subjects. We 
found that conversational explanations can significantly improve overall user 
experiences of trust, transparency perception, and learning. Besides, we showed 
that by interweaving explanations into conversation flow, OSC could facilitate 
users’ comprehension of the diagnostics in a dynamic and timely fashion. 

The findings contributed empirical insights into user experiences with expla- 
nations in healthcare AI applications. Second, we derived conceptual insights 
into OSC transparency. Third, we proposed design implications for improving 
transparency in healthcare technologies, and especially explanation design in 
conversational agents. 
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Like Comment 


1 people like this 


ry Health Chatbot: @tomb Riding a bike offers a whole host of additional health benefits besides 
the physical perks. 
Feb 15th 2020, 7:27 pm Report 

8) Health Chatbot: @tomb Aside from using your rides to get in the recommended 150 minutes 
of weekly exercise 
Feb 15th 2020, 7:27 pm Report 


Ey Health Chatbot: @tomb We recommend Cycling to you because you and Katie both like the 
activity in Sunny day, it is a good time to enjoy the activity together 
Feb 15th 2020, 7:27 pm Report 


All comments 


& 


Write a comment o | 


Fig. 2. Example of the Explainable AI-Mediated communication. 


3.2 Case 2: Explainable AI-Mediated Communication (XAI-MC) 


The integral part of modern health promotion initiatives for non-collocated mem- 
bers is computer-mediated communications [11]. The concept has been exten- 
sively adopted as an interpersonal communication medium in public health 
research such as telemedicine and mental health supports. Today, the Artificial 
Intelligence-Mediated Communication (AI-MC) between people could be aug- 
mented by computational agents to achieve different communication goals [9]. 
For instance, the interpersonal text-based communications (e.g., email) could 
be augmented by auto-correct, auto-completion, or auto-response. AI-MC has 
received more and more attention in recent socially efficacious research. For 
example, an AI agent could undermine the writers’ message by altering the neg- 
ative keywords (e.g., “sorry” ) to encourage the user to normalize language as the 
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right way of speaking. AI agent could mitigate interpersonal biases by triggering 
alert messages when the agents detected the users intend to post negative mes- 
sages on social media [15]. The introduction of AI brings new opportunities to 
adopt computational agents in family health collaboration and communications. 
AI-MC could be used to engage family members’ health conversation better, and 
the communication may translate into healthy behavioral changes [6]. We can 
introduce a designate agent to mediate the communication by recommending 
and explaining the health information to the family members. Little attention 
has been paid to the question of how computational agents ought to disclose to 
users in AI-MC and the effects on family health promotion. 

We explored the effects of promoting non-collocated family members’ healthy 
lifestyle through Ezplainable Al-meditated Communication (XAI-MC). We 
examined how XAI-MC would help non-collocated family members to engage 
in conversations about health, to learn more about each other’s healthy prac- 
tices, and as a result to encourage family collaboration via an online platform. 
We are particularly interested in exploring the effect of bringing transparent AI 
agents to the family communication. Specifically, we proposed to design a trans- 
parent AI agent to mediate the non-collocated family members’ communication 
on healthy lifestyles. In our design, the users could share healthy activities infor- 
mation for enhancing family health awareness and engagement in a social media 
application. In the platform (shown in Fig.2), a designate Al-powered health 
chat bot was used to mediate family members’ communication on social media 
by explaining the health recommendations to them. We adopted the explainable 
health recommendations to address existing challenges related to remote family 
collaboration on health through XAI-MC. The findings could help to generate 
insights into designing transparent AI agents to support collaborating and shar- 
ing health and well-being information with online conversation. 

We conducted a week-long field study with 26 participants who have at least 
one non-collocated family member or friend willing to join the study together. 
Based on a within-subject design, participants were assigned to two study phases: 
1) AI-MC with non-explainable health recommendation and 2) XAI-MC with 
explainable health recommendation. We adopted a mixed-method to evaluate our 
design by collecting quantitative and qualitative feedback. We found evidence 
to support that providing transparent AI agents helped individuals gain health 
awareness, engage in conversations about healthy living practices, and promote 
collaboration among family members. Our findings provide insights into develop- 
ing effective family-centered health interventions that aid non-collocated families 
in cultivating health together. The experiment results help to explain how trans- 
parent AI agents could mediate the health conversation and collaboration within 
non-collocated families. 


4 Usability, Explaniability and Causability 


The two case studies present our preliminary findings to support our arguments 
on the XAI is more than a matter of accurate explainable or interpretable mod- 
els. Here we would like to draw a historical analogy to usability. One tension 
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in contemporary AI is the perception that core system qualities like speed, effi- 
ciency, accuracy and reliability might be compromised by pursuing objectives like 
transparency and accountability for some form of diffuse explanatory value [9]. 
But though our understanding of qualities like transparency and accountability 
is limited, this can be directly addressed to enhance the causability. 

The trend of Explainable AI can be seen as analogous to usability: merely 
simplifying a user interface (in a logical/formal sense) may or may not make 
it more usable, instead the key to usability is a set of pragmatic conditions. It 
must be satisfying, challenging, informative, intuitive, etc. We could conclude 
that XAI is more than a matter of accurate and complete explanation, that 
it requires pragmatics of explanation in order to address the issues it seeks to 
address. One specific issue in XAI is that AI should be able to explain how it is 
fair. Such an explanation will necessarily intersect with an accurate system model 
but would be much more focused on interaction scenarios and user experiences. 

On the history in age of 1980 simple noting of usability. Only saying keep 
simple as stupid? Directly pursue the simple solution is not the same as usability. 
User’s ability to trust of understand AI is not sufficient. Usability we don’t really 
have a theory in these aspects. Usability is not equal to empirical evidence, to do 
experiment with kind of explanations and exploratory interaction and explore 
the consequence. The consequence could be part of the usability. Something 
goes wrong, and the users need an explanation, i.e., we want to know what is 
happening. Explanations could be an engagement. Active thinking and active 
learning, user interaction and usability, and wrong and addition situations. Try 
to understand the system model, but why do uses want to get this explanation? 

Carroll and Aaronson [3] investigated a Wizard of Oz simulation of intelligent 
help. They studied interactions with a popular database application and identi- 
fied 13 critical user errors, including the application state people were in when 
they made these errors. In this way, the help simulation recognized and guided 
recovery from a set of serious mistakes. Carroll and Aaronson designed two kinds 
of helpful information: “how-it-works,” explaining how the system model worked 
to allow the error and leaving it to the user to discover what to do, and “how 
to do it,” describing procedures the user should follow to recover from the error 
and continue their task. They found that people often preferred help messages 
explaining how the database application worked, for example, when it noted the 
distinction between forms and data when users entered both field labels and 
numeric values. When puzzled by the system, such interactions were satisfying 
to users, but “how-it-works” messages particularly pleased users in answering 
questions just as they were being formulated. Simplifying a user interface in a 
logical/formal sense may or may not make it more usable. Key usability also 
considers pragmatic conditions - systems must be satisfying, challenging, infor- 
mative, intuitive, etc. 

The field of human-computer interaction (HCI) coalesced around the concept 
of usability in the early 1980s, but not because the idea was already defined 
clearly, or could be predictably achieved in system design. It was instead because 
the diffuse and emerging concept of usability evoked a considerable amount of 
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productive inquiry into the nature and consequences of usability, fundamentally 
changing how technology developers, users, and everyone thought about what 
using a computer could and should be [4]. Suppose that AI technologies were 
correctly reconceptualized, including the capability to effectively explain what 
they are doing, how they are doing it, and what courses of action they are 
considering. Adequate, in this context, would mean codifying and reporting on 
plans and activities in a way that is intelligible to humans. The standard would 
not be a superficial Turing-style simulacrum but a depth-oriented investigation 
of human-computer interaction to fundamentally advance our understanding of 
accountability and transparency. We have already seen how such a program of 
inquiry can transform computing. 


References 


1. Amershi, S., et al.: Guidelines for human-ai interaction. In: Proceedings of the 2019 
CHI Conference on Human Factors in Computing Systems, p. 3. ACM (2019) 

2. Anderson, A., et al.: Mental models of mere mortals with explanations of reinforce- 
ment learning. ACM Trans. Interact. Intell. Syst. (TiiS) 10(2), 1-37 (2020) 

3. Carroll, J., Aaronson, A.: Learning by doing with simulated intelligent help. Com- 
mun. ACM 31(9), 1064-1079 (1988) 

4. Carroll, J.M.: Beyond fun. Interactions 11(5), 38-40 (2004) 

5. Craik, K.J.W.: The Nature of Explanation, vol. 445. CUP Archive, Cambridge 
(1952) 

6. Dragoni, M., Donadello, I., Eccher, C.: Explainable AI meets persuasiveness: trans- 
lating reasoning results into behavioral change advice. Artif. Intell. Med. 105, 
101840 (2020) 

7. Eiband, M., Schneider, H., Bilandzic, M., Fazekas-Con, J., Haug, M., Hussmann, 
H.: Bringing transparency design into practice. In: 23rd International Conference 
on Intelligent User Interfaces, pp. 211-223. ACM (2018) 

8. Guy, I.: Social recommender systems. In: Ricci, F., Rokach, L., Shapira, B. (eds.) 
Recommender Systems Handbook, pp. 511-543. Springer, Boston, MA (2015). 
https: //doi.org/10.1007/978-1-4899-7637-6_15 

9. Hancock, J.T., Naaman, M., Levy, K.: Ai-mediated communication: definition, 
research agenda, and ethical considerations. J. Comput. Mediat. Commun. 25(1), 
89-100 (2020) 

10. Herlocker, J.L., Konstan, J.A., Riedl, J.: Explaining collaborative filtering recom- 
mendations. In: Proceedings of the 2000 ACM Conference on Computer Supported 
Cooperative Work, pp. 241-250. ACM (2000) 

11. Herring, S.C.: Computer-mediated communication on the internet. Ann. Rev. Inf. 
Sci. Technol. 36(1), 109-168 (2002) 

12. Hilton, D.J.: Conversational processes and causal explanation. Psychol. Bull. 
107(1), 65 (1990) 

13. Holzinger, A., Carrington, A., Miiller, H.: Measuring the quality of explanations: 
the system causability scale (scs). KI-Kiinstliche Intelligenz 34(2), 193-198 (2020) 

14. Knijnenburg, B.P., Bostandjiev, S., O’Donovan, J., Kobsa, A.: Inspectability and 
control in social recommenders. In: Proceedings of the Sixth ACM Conference on 
Recommender Systems, pp. 43-50. ACM (2012) 

15. Levy, K., Barocas, S.: Designing against discrimination in online markets. Berkeley 
Technol. Law J. 32(3), 1183-1238 (2017) 


396 C.-H. Tsai and J. M. Carroll 


16. Liao, Q.V., Gruen, D., Miller, S.: Questioning the AI: informing design practices 
for explainable AI user experiences. In: Proceedings of the 2020 CHI Conference 
on Human Factors in Computing Systems, pp. 1-15 (2020) 

17. Liao, Q.V., et al.: All work and no play? In: Proceedings of the 2018 CHI Conference 
on Human Factors in Computing Systems, pp. 1-13 (2018) 

18. Miller, T.: Explanation in artificial intelligence: insights from the social sciences. 

Artif. Intell. 267, 1-38 (2019) 

19. Mittelstadt, B., Russell, C., Wachter, S.: Explaining explanations in AI. In: Pro- 

ceedings of the Conference on Fairness, Accountability, and Transparency, pp. 279- 

288 (2019) 

20. Ngo, T., Kunkel, J., Ziegler, J.: Exploring mental models for transparent and con- 

trollable recommender systems: A qualitative study. In: Proceedings of the 28th 

ACM Conference on User Modeling, Adaptation and Personalization, pp. 183-191 

2020) 

21. Norman, D.A.: Some observations on mental models. Ment. Models 7(112), 7-14 

1983) 

22. O’Donovan, J., Smyth, B., Gretarsson, B., Bostandjiev, S., Höllerer, T.: Peer- 
chooser: visual interactive recommendation. In: Proceedings of the SIGCHI Con- 
ference on Human Factors in Computing Systems, pp. 1085-1088. ACM (2008) 

23. Powley, L., McIlroy, G., Simons, G., Raza, K.: Are online symptoms checkers useful 
for patients with inflammatory arthritis? BMC Musculoskelet. Disord. 17(1), 362 
(2016) 

24. Ruben, D.H.: Explaining Explanation. Routledge, London (2015) 

25. Tintarev, N., Masthoff, J.: Explaining recommendations: design and evaluation. 
In: Ricci, F., Rokach, L., Shapira, B. (eds.) Recommender Systems Handbook, pp. 
353-382. Springer, Boston, MA (2015). https://doi.org/10.1007/978-1-4899-7637- 
6-10 

26. Tsai, C.-H., Brusilovsky, P.: The effects of controllability and explainability in 
a social recommender system. User Model. User-Adapt. Interact. 31(3), 591-627 
(2020) 

27. Tsai, C.H., You, Y., Gui, X., Kou, Y., Carroll, J.M.: Exploring and promoting 
diagnostic transparency and explainability in online symptom checkers. In: Pro- 
ceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 
pp. 1-17 (2021) 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the 
chapter’s Creative Commons license, unless indicated otherwise in a credit line to the 
material. If material is not included in the chapter’s Creative Commons license and 
your intended use is not permitted by statutory regulation or exceeds the permitted 
use, you will need to obtain permission directly from the copyright holder. 


Akata, Zeynep 69 
Arjona-Medina, Jose A. 177 


Bargal, Sarah Adel 255 
Barnes, Elizabeth A. 315 
Bastani, Osbert 207 
Becking, Daniel 271 
Biecek, Przemyslaw 13 
Bischl, Bernd 39 

Blies, Patrick M. 177 
Brandstetter, Johannes 177 
Bruna, Joan 91 


Carroll, John M. 387 
Casalicchio, Giuseppe 39 
Cheeseman, Ted 297 
Chen, Fang 375 


Dandl, Susanne 39 

Dinu, Marius-Constantin 177 
Dorfer, Matthias 177 

Dreyer, Maximilian 271 


Ebert-Uphoff, Imme 315 


Fong, Ruth 3 
Freiesleben, Timo 39 


Goebel, Randy 3 
Grosse-Wentrup, Moritz 39 


Ha, Wooseok 229 

Hacker, Philipp 343 
Herbinger, Julia 39 
Hochreiter, Sepp 177 
Hofmarcher, Markus 177 
Holzinger, Andreas 3, 13, 375 


Inala, Jeevana Priya 207 


Karimi, Amir-Hossein 139 
Kauffmann, Jacob 117 
Kierdorf, Jana 297 
Koepke, A. Sophia 69 
Kolek, Stefan 91 


Author Index 


König, Gunnar 39 
Kutyniok, Gitta 91 


Lapuschkin, Sebastian 271 
Lensch, Hendrik P. A. 69 
Levie, Ron 91 


Mamalakis, Antonios 315 
Marcos, Diego 297 

Molnar, Christoph 13, 39 
Montavon, Grégoire 117 
Moon, Taesup 3 

Müller, Karsten 271 

Müller, Klaus-Robert 3,117 
Murino, Vittorio 255 


Nguyen, Duc Anh 91 


Passoth, Jan-Hendrik 343 
Patil, Vihang P. 177 
Petsiuk, Vitali 255 


Roscher, Ribana 297 


Saenko, Kate 255 
Salewski, Leonard 69 


Samek, Wojciech 3, 13, 117, 271 


Saranti, Anna 13 
Scholbeck, Christian A. 39 
Schölkopf, Bernhard 139 
Sclaroff, Stan 255 

Singh, Chandan 229 
Solar-Lezama, Armando 207 


Tsai, Chun-Hua 387 
Tuia, Devis 297 


Valera, Isabel 139 
von Kügelgen, Julius 139 


Yu, Bin 229 


Zhang, Jianming 255 
Zhou, Bolei 167 
Zhou, Jianlong 375 
Zunino, Andrea 255 


